DTop-p MoE: Dynamic Top-p Routing for Foundation Model Pre-training

ai-technology · 2026-06-01

A new paper on arXiv proposes DTop-p, a dynamic routing mechanism for sparse Mixture-of-Experts (MoE) architectures. Standard Top-k routing uses a fixed number of experts per token, ignoring token difficulty and layer-specific needs. Top-p routing adaptively selects experts based on cumulative probability threshold, but existing naive implementations with fixed global thresholds offer marginal gains, suffer hyperparameter sensitivity, and cause uncontrolled costs. DTop-p uses a Proportional-Integral controller to learn the Top-p probability threshold per layer, enabling sparsity control and dynamic routing normalization. The method aims to improve efficiency and performance in foundation model pre-training.

Key facts

DTop-p is a sparsity-controllable dynamic routing mechanism for MoE.
It uses a Proportional-Integral controller to learn the Top-p probability threshold.
Dynamic routing normalization supports layer-wise expert selection.
Standard Top-k routing imposes rigid sparsity ignoring token difficulty.
Naive Top-p with fixed global thresholds provides marginal gains over Top-k.
The paper is on arXiv with ID 2512.13996.
The method targets foundation model pre-training.
DTop-p addresses hyperparameter sensitivity and uncontrolled costs.

DTop-p MoE: Dynamic Top-p Routing for Foundation Model Pre-training

Key facts

Entities

Institutions

Sources