MoE Inference Bottleneck: Expert Activation Patterns in Llama 4, DeepSeek V3, Qwen3

ai-technology · 2026-04-29

A new arXiv preprint (2604.23150) identifies expert load imbalance and inefficient token routing as fundamental bottlenecks in multi-node Mixture-of-Experts (MoE) inference for large language models. The authors profiled state-of-the-art open-source MoE models—Llama 4 Maverick, DeepSeek V3-671B, and Qwen3-230B-A22B—on various datasets, collecting over 100,000 real expert activation traces. They uncovered persistent properties across all frontier MoE models: variable expert load imbalance, domain-specific expert activation, and significant inter-node all-to-all communication overhead when tokens are not routed to local experts. The study systematically characterizes these challenges to inform future optimization strategies for scalable MoE serving.

Key facts

arXiv:2604.23150
MoE inference bottlenecked by expert load imbalance and inefficient token routing
Multi-node deployments suffer inter-node all-to-all communication overhead
Profiled Llama 4 Maverick, DeepSeek V3-671B, Qwen3-230B-A22B
Collected over 100,000 real expert activation traces
Uncovered variable expert load imbalance across all models
Domain-specific expert activation patterns observed
Study aims to inform optimization for scalable MoE serving

MoE Inference Bottleneck: Expert Activation Patterns in Llama 4, DeepSeek V3, Qwen3

Key facts

Entities

Institutions

Sources