EMO: Pretraining Mixture of Experts for Emergent Modularity
The Allen Institute for AI has released EMO, a new mixture-of-experts (MoE) language model pretrained with an objective that encourages modular structure to emerge from data. Unlike standard MoEs, where experts specialize in low-level lexical patterns, EMO's experts organize into semantically coherent groups corresponding to domains like health, news, or politics. This allows selective use of only 12.5% of experts (16 out of 128) for a given task while retaining near full-model performance, with only about 3% absolute performance drop. The model has 1 billion active parameters and 14 billion total parameters, trained on 1 trillion tokens. EMO achieves this by restricting all tokens in a document to route within a shared expert pool, using document boundaries as a weak supervisory signal. Global load balancing prevents collapse. The model matches standard MoE performance when all experts are used. The release includes the EMO-trained model, a standard MoE baseline, and training code.
Key facts
- EMO is a 1B-active, 14B-total-parameter MoE with 128 experts, 8 active per token.
- Trained on 1 trillion tokens.
- Selective use of 12.5% of experts (16) retains near full-model performance with ~3% drop.
- Experts specialize in semantic domains like Health, News, Politics, not lexical patterns.
- Document-level routing constraint enforces consistent expert usage within documents.
- Global load balancing used to prevent collapse.
- Matches standard MoE performance when all experts are used.
- Released by the Allen Institute for AI on Hugging Face.
Entities
Institutions
- Allen Institute for AI
- Hugging Face