EMO: Pretraining Mixture of Experts for Emergent Modularity

ai-technology · 2026-05-08

The Allen Institute for AI has released EMO, a new mixture-of-experts (MoE) language model pretrained with an objective that encourages modular structure to emerge from data. Unlike standard MoEs, where experts specialize in low-level lexical patterns, EMO's experts organize into semantically coherent groups corresponding to domains like health, news, or politics. This allows selective use of only 12.5% of experts (16 out of 128) for a given task while retaining near full-model performance, with only about 3% absolute performance drop. The model has 1 billion active parameters and 14 billion total parameters, trained on 1 trillion tokens. EMO achieves this by restricting all tokens in a document to route within a shared expert pool, using document boundaries as a weak supervisory signal. Global load balancing prevents collapse. The model matches standard MoE performance when all experts are used. The release includes the EMO-trained model, a standard MoE baseline, and training code.

Key facts

EMO is a 1B-active, 14B-total-parameter MoE with 128 experts, 8 active per token.
Trained on 1 trillion tokens.
Selective use of 12.5% of experts (16) retains near full-model performance with ~3% drop.
Experts specialize in semantic domains like Health, News, Politics, not lexical patterns.
Document-level routing constraint enforces consistent expert usage within documents.
Global load balancing used to prevent collapse.
Matches standard MoE performance when all experts are used.
Released by the Allen Institute for AI on Hugging Face.

EMO: Pretraining Mixture of Experts for Emergent Modularity

Key facts

Entities

Institutions

Sources