Research Reveals How Mixture-of-Experts Models Route Information Through Control and Content Channels
A recent study presents a novel parameter-free decomposition technique for Mixture-of-Experts (MoE) models, examining six unique architectures. The findings reveal that the hidden state of each layer divides into two separate channels: one for a control signal that influences routing choices and another for an orthogonal content channel that the router cannot detect. The content channel retains surface-level attributes such as language, token identity, and position, whereas the control signal represents an abstract function that varies across layers. Due to the low bandwidth of routing decisions, this division necessitates compositional specialization among layers. Although individual experts in the models exhibit polysemy, the routes they take become monosemantic, grouping tokens by semantic function across various languages and forms. The same token may navigate different paths based on its semantic context. This research was published on arXiv under identifier 2604.17837v1.
Key facts
- A parameter-free decomposition method for Mixture-of-Experts models was introduced
- The method splits each layer's hidden state into control and content channels
- Six different MoE architectures were analyzed in the research
- Surface-level features are preserved in the content channel
- The control signal encodes an abstract function that rotates between layers
- Routing decisions operate with low bandwidth
- Individual experts remain polysemantic while expert paths become monosemantic
- The research was announced on arXiv with identifier 2604.17837v1
Entities
Institutions
- arXiv