Expert Upcycling: Efficiently Scaling Mixture-of-Experts Models
A new method called expert upcycling enables progressive expansion of Mixture-of-Experts (MoE) large language models by increasing the number of experts during continued pre-training. The technique duplicates existing experts and extends the router while keeping top-K routing fixed, preserving per-token inference cost. This provides a warm initialization from a trained model, reducing memory and communication overhead compared to training larger MoEs from scratch. The approach aims to shift the compute-efficient frontier by scaling total parameters without increasing active computation. The paper is published on arXiv under identifier 2604.19835.
Key facts
- Expert upcycling expands MoE capacity by increasing expert count during continued pre-training.
- The method duplicates experts and extends the router while keeping top-K routing fixed.
- It preserves per-token inference cost and provides warm initialization from a trained model.
- Training large MoEs is expensive due to memory and communication scaling with total parameters.
- The technique aims to shift the compute-efficient frontier for MoE models.
- Paper available on arXiv: 2604.19835.
Entities
Institutions
- arXiv