ARTFEED — Contemporary Art Intelligence

Expert Upcycling: Efficiently Scaling Mixture-of-Experts Models

ai-technology · 2026-04-24

A new method called expert upcycling enables progressive expansion of Mixture-of-Experts (MoE) large language models by increasing the number of experts during continued pre-training. The technique duplicates existing experts and extends the router while keeping top-K routing fixed, preserving per-token inference cost. This provides a warm initialization from a trained model, reducing memory and communication overhead compared to training larger MoEs from scratch. The approach aims to shift the compute-efficient frontier by scaling total parameters without increasing active computation. The paper is published on arXiv under identifier 2604.19835.

Key facts

  • Expert upcycling expands MoE capacity by increasing expert count during continued pre-training.
  • The method duplicates experts and extends the router while keeping top-K routing fixed.
  • It preserves per-token inference cost and provides warm initialization from a trained model.
  • Training large MoEs is expensive due to memory and communication scaling with total parameters.
  • The technique aims to shift the compute-efficient frontier for MoE models.
  • Paper available on arXiv: 2604.19835.

Entities

Institutions

  • arXiv

Sources