DODOCO Framework Tests MoE Dispatch Overhead Assumptions
So, there’s this new framework called DODOCO that’s shaking things up by questioning two key ideas behind solutions for AlltoAll dispatch delays in Mixture-of-Experts (MoE) parallelism. It challenges the notion that the system layer can fix routing issues and the belief that mock-token benchmarks truly represent real-world routing. DODOCO tested five MoE models—DeepSeek-V2-Lite MLA, DeepSeek-MoE-16B MHA, Qwen3-30B GQA, Nemotron-30B Mamba-2, and Qwen3.5-35B GDN—across a variety of data conditions. The results showed that scaling expert parallelism only changes the per-expert token ratio slightly, indicating that the straggler problem is built into the system.
Key facts
- AlltoAll dispatch is the dominant bottleneck of MoE expert parallelism.
- Four families of mitigations exist: predictive sample placement, adaptive expert relayout, hierarchical collectives, and EP-aware topology.
- DODOCO tests two assumptions: routing imbalance is correctable, and mock-token benchmarks represent production routing.
- Five MoE checkpoints tested: DeepSeek-V2-Lite MLA, DeepSeek-MoE-16B MHA, Qwen3-30B GQA, Nemotron-30B Mamba-2, Qwen3.5-35B GDN.
- Experiments used a 5 by 6 grid of data conditions and EP scan from 4 to 32 ranks on H100s.
- Both assumptions fail; scaling EP changes per-expert max/mean token ratio by at most 5%.
- The straggler is intrinsic, not correctable by system layer.
- The paper is on arXiv with ID 2605.20982.
Entities
Institutions
- arXiv