ARTFEED — Contemporary Art Intelligence

EMO: Pretraining Mixture of Experts for Emergent Modularity

ai-technology · 2026-05-08

The Allen Institute for AI has released EMO, a new mixture-of-experts (MoE) language model pretrained with an objective that encourages modular structure to emerge from data. Unlike standard MoEs, where experts specialize in low-level lexical patterns, EMO's experts organize into semantically coherent groups corresponding to domains like health, news, or politics. This allows selective use of only 12.5% of experts (16 out of 128) for a given task while retaining near full-model performance, with only about 3% absolute performance drop. The model has 1 billion active parameters and 14 billion total parameters, trained on 1 trillion tokens. EMO achieves this by restricting all tokens in a document to route within a shared expert pool, using document boundaries as a weak supervisory signal. Global load balancing prevents collapse. The model matches standard MoE performance when all experts are used. The release includes the EMO-trained model, a standard MoE baseline, and training code.

Key facts

  • EMO is a 1B-active, 14B-total-parameter MoE with 128 experts, 8 active per token.
  • Trained on 1 trillion tokens.
  • Selective use of 12.5% of experts (16) retains near full-model performance with ~3% drop.
  • Experts specialize in semantic domains like Health, News, Politics, not lexical patterns.
  • Document-level routing constraint enforces consistent expert usage within documents.
  • Global load balancing used to prevent collapse.
  • Matches standard MoE performance when all experts are used.
  • Released by the Allen Institute for AI on Hugging Face.

Entities

Institutions

  • Allen Institute for AI
  • Hugging Face

Sources