ARTFEED — Contemporary Art Intelligence

DODOCO Framework Tests MoE Dispatch Overhead Assumptions

publication · 2026-05-22

So, there’s this new framework called DODOCO that’s shaking things up by questioning two key ideas behind solutions for AlltoAll dispatch delays in Mixture-of-Experts (MoE) parallelism. It challenges the notion that the system layer can fix routing issues and the belief that mock-token benchmarks truly represent real-world routing. DODOCO tested five MoE models—DeepSeek-V2-Lite MLA, DeepSeek-MoE-16B MHA, Qwen3-30B GQA, Nemotron-30B Mamba-2, and Qwen3.5-35B GDN—across a variety of data conditions. The results showed that scaling expert parallelism only changes the per-expert token ratio slightly, indicating that the straggler problem is built into the system.

Key facts

  • AlltoAll dispatch is the dominant bottleneck of MoE expert parallelism.
  • Four families of mitigations exist: predictive sample placement, adaptive expert relayout, hierarchical collectives, and EP-aware topology.
  • DODOCO tests two assumptions: routing imbalance is correctable, and mock-token benchmarks represent production routing.
  • Five MoE checkpoints tested: DeepSeek-V2-Lite MLA, DeepSeek-MoE-16B MHA, Qwen3-30B GQA, Nemotron-30B Mamba-2, Qwen3.5-35B GDN.
  • Experiments used a 5 by 6 grid of data conditions and EP scan from 4 to 32 ranks on H100s.
  • Both assumptions fail; scaling EP changes per-expert max/mean token ratio by at most 5%.
  • The straggler is intrinsic, not correctable by system layer.
  • The paper is on arXiv with ID 2605.20982.

Entities

Institutions

  • arXiv

Sources