DTop-p MoE: Dynamic Top-p Routing for Foundation Model Pre-training
A new paper on arXiv proposes DTop-p, a dynamic routing mechanism for sparse Mixture-of-Experts (MoE) architectures. Standard Top-k routing uses a fixed number of experts per token, ignoring token difficulty and layer-specific needs. Top-p routing adaptively selects experts based on cumulative probability threshold, but existing naive implementations with fixed global thresholds offer marginal gains, suffer hyperparameter sensitivity, and cause uncontrolled costs. DTop-p uses a Proportional-Integral controller to learn the Top-p probability threshold per layer, enabling sparsity control and dynamic routing normalization. The method aims to improve efficiency and performance in foundation model pre-training.
Key facts
- DTop-p is a sparsity-controllable dynamic routing mechanism for MoE.
- It uses a Proportional-Integral controller to learn the Top-p probability threshold.
- Dynamic routing normalization supports layer-wise expert selection.
- Standard Top-k routing imposes rigid sparsity ignoring token difficulty.
- Naive Top-p with fixed global thresholds provides marginal gains over Top-k.
- The paper is on arXiv with ID 2512.13996.
- The method targets foundation model pre-training.
- DTop-p addresses hyperparameter sensitivity and uncontrolled costs.
Entities
Institutions
- arXiv