ARTFEED — Contemporary Art Intelligence

MoE Inference Bottleneck: Expert Activation Patterns in Llama 4, DeepSeek V3, Qwen3

ai-technology · 2026-04-29

A new arXiv preprint (2604.23150) identifies expert load imbalance and inefficient token routing as fundamental bottlenecks in multi-node Mixture-of-Experts (MoE) inference for large language models. The authors profiled state-of-the-art open-source MoE models—Llama 4 Maverick, DeepSeek V3-671B, and Qwen3-230B-A22B—on various datasets, collecting over 100,000 real expert activation traces. They uncovered persistent properties across all frontier MoE models: variable expert load imbalance, domain-specific expert activation, and significant inter-node all-to-all communication overhead when tokens are not routed to local experts. The study systematically characterizes these challenges to inform future optimization strategies for scalable MoE serving.

Key facts

  • arXiv:2604.23150
  • MoE inference bottlenecked by expert load imbalance and inefficient token routing
  • Multi-node deployments suffer inter-node all-to-all communication overhead
  • Profiled Llama 4 Maverick, DeepSeek V3-671B, Qwen3-230B-A22B
  • Collected over 100,000 real expert activation traces
  • Uncovered variable expert load imbalance across all models
  • Domain-specific expert activation patterns observed
  • Study aims to inform optimization for scalable MoE serving

Entities

Institutions

  • arXiv

Sources