Expert Upcycling: Efficiently Scaling Mixture-of-Experts Models

ai-technology · 2026-04-24

A new method called expert upcycling enables progressive expansion of Mixture-of-Experts (MoE) large language models by increasing the number of experts during continued pre-training. The technique duplicates existing experts and extends the router while keeping top-K routing fixed, preserving per-token inference cost. This provides a warm initialization from a trained model, reducing memory and communication overhead compared to training larger MoEs from scratch. The approach aims to shift the compute-efficient frontier by scaling total parameters without increasing active computation. The paper is published on arXiv under identifier 2604.19835.

Key facts

Expert upcycling expands MoE capacity by increasing expert count during continued pre-training.
The method duplicates experts and extends the router while keeping top-K routing fixed.
It preserves per-token inference cost and provides warm initialization from a trained model.
Training large MoEs is expensive due to memory and communication scaling with total parameters.
The technique aims to shift the compute-efficient frontier for MoE models.
Paper available on arXiv: 2604.19835.

Expert Upcycling: Efficiently Scaling Mixture-of-Experts Models

Key facts

Entities

Institutions

Sources