MoE Inference Bottleneck: Expert Activation Patterns in Llama 4, DeepSeek V3, Qwen3
A new arXiv preprint (2604.23150) identifies expert load imbalance and inefficient token routing as fundamental bottlenecks in multi-node Mixture-of-Experts (MoE) inference for large language models. The authors profiled state-of-the-art open-source MoE models—Llama 4 Maverick, DeepSeek V3-671B, and Qwen3-230B-A22B—on various datasets, collecting over 100,000 real expert activation traces. They uncovered persistent properties across all frontier MoE models: variable expert load imbalance, domain-specific expert activation, and significant inter-node all-to-all communication overhead when tokens are not routed to local experts. The study systematically characterizes these challenges to inform future optimization strategies for scalable MoE serving.
Key facts
- arXiv:2604.23150
- MoE inference bottlenecked by expert load imbalance and inefficient token routing
- Multi-node deployments suffer inter-node all-to-all communication overhead
- Profiled Llama 4 Maverick, DeepSeek V3-671B, Qwen3-230B-A22B
- Collected over 100,000 real expert activation traces
- Uncovered variable expert load imbalance across all models
- Domain-specific expert activation patterns observed
- Study aims to inform optimization for scalable MoE serving
Entities
Institutions
- arXiv