DisagMoE: Overlapping Computation and Communication in MoE Training
DisagMoE is a novel system for training Mixture-of-Experts (MoE) models, which are used for large language models (LLMs) with trillions of parameters. MoE architectures rely on sparsely activated experts, and expert parallelism (EP) is a common training strategy. However, EP suffers from all-to-all communication bottlenecks, especially as model size grows and experts must be distributed across GPU nodes with limited inter-node bandwidth. Prior work attempted to overlap these communications with feed-forward network (FFN) and self-attention computations, but residual network-bound stalls remain due to imbalance in computation-communication ratios. DisagMoE addresses this by disaggregating attention and FFN layers into disjoint GPU groups, introducing a multi-stage pipeline with uni-directional, many-to-many communications. The system jointly optimizes model placement and scheduling for maximal efficiency. The paper is available on arXiv under ID 2605.11005.
Key facts
- DisagMoE is a disaggregated MoE training system.
- It separates attention and FFN layers into disjoint GPU groups.
- It uses a multi-stage pipeline with uni-directional, many-to-many communications.
- It jointly optimizes model placement and scheduling.
- The paper is on arXiv with ID 2605.11005.
- MoE architectures enable trillion-parameter LLMs.
- Expert parallelism suffers from all-to-all communication bottlenecks.
- Prior work left residual network-bound stalls.
Entities
Institutions
- arXiv