DisagMoE: Overlapping Computation and Communication in MoE Training

ai-technology · 2026-05-13

DisagMoE is a novel system for training Mixture-of-Experts (MoE) models, which are used for large language models (LLMs) with trillions of parameters. MoE architectures rely on sparsely activated experts, and expert parallelism (EP) is a common training strategy. However, EP suffers from all-to-all communication bottlenecks, especially as model size grows and experts must be distributed across GPU nodes with limited inter-node bandwidth. Prior work attempted to overlap these communications with feed-forward network (FFN) and self-attention computations, but residual network-bound stalls remain due to imbalance in computation-communication ratios. DisagMoE addresses this by disaggregating attention and FFN layers into disjoint GPU groups, introducing a multi-stage pipeline with uni-directional, many-to-many communications. The system jointly optimizes model placement and scheduling for maximal efficiency. The paper is available on arXiv under ID 2605.11005.

Key facts

DisagMoE is a disaggregated MoE training system.
It separates attention and FFN layers into disjoint GPU groups.
It uses a multi-stage pipeline with uni-directional, many-to-many communications.
It jointly optimizes model placement and scheduling.
The paper is on arXiv with ID 2605.11005.
MoE architectures enable trillion-parameter LLMs.
Expert parallelism suffers from all-to-all communication bottlenecks.
Prior work left residual network-bound stalls.

DisagMoE: Overlapping Computation and Communication in MoE Training

Key facts

Entities

Institutions

Sources