ARTFEED — Contemporary Art Intelligence

DisagMoE: Overlapping Computation and Communication in MoE Training

ai-technology · 2026-05-13

DisagMoE is a novel system for training Mixture-of-Experts (MoE) models, which are used for large language models (LLMs) with trillions of parameters. MoE architectures rely on sparsely activated experts, and expert parallelism (EP) is a common training strategy. However, EP suffers from all-to-all communication bottlenecks, especially as model size grows and experts must be distributed across GPU nodes with limited inter-node bandwidth. Prior work attempted to overlap these communications with feed-forward network (FFN) and self-attention computations, but residual network-bound stalls remain due to imbalance in computation-communication ratios. DisagMoE addresses this by disaggregating attention and FFN layers into disjoint GPU groups, introducing a multi-stage pipeline with uni-directional, many-to-many communications. The system jointly optimizes model placement and scheduling for maximal efficiency. The paper is available on arXiv under ID 2605.11005.

Key facts

  • DisagMoE is a disaggregated MoE training system.
  • It separates attention and FFN layers into disjoint GPU groups.
  • It uses a multi-stage pipeline with uni-directional, many-to-many communications.
  • It jointly optimizes model placement and scheduling.
  • The paper is on arXiv with ID 2605.11005.
  • MoE architectures enable trillion-parameter LLMs.
  • Expert parallelism suffers from all-to-all communication bottlenecks.
  • Prior work left residual network-bound stalls.

Entities

Institutions

  • arXiv

Sources