ARTFEED — Contemporary Art Intelligence

DTop-p MoE: Dynamic Top-p Routing for Foundation Model Pre-training

ai-technology · 2026-06-01

A new paper on arXiv proposes DTop-p, a dynamic routing mechanism for sparse Mixture-of-Experts (MoE) architectures. Standard Top-k routing uses a fixed number of experts per token, ignoring token difficulty and layer-specific needs. Top-p routing adaptively selects experts based on cumulative probability threshold, but existing naive implementations with fixed global thresholds offer marginal gains, suffer hyperparameter sensitivity, and cause uncontrolled costs. DTop-p uses a Proportional-Integral controller to learn the Top-p probability threshold per layer, enabling sparsity control and dynamic routing normalization. The method aims to improve efficiency and performance in foundation model pre-training.

Key facts

  • DTop-p is a sparsity-controllable dynamic routing mechanism for MoE.
  • It uses a Proportional-Integral controller to learn the Top-p probability threshold.
  • Dynamic routing normalization supports layer-wise expert selection.
  • Standard Top-k routing imposes rigid sparsity ignoring token difficulty.
  • Naive Top-p with fixed global thresholds provides marginal gains over Top-k.
  • The paper is on arXiv with ID 2512.13996.
  • The method targets foundation model pre-training.
  • DTop-p addresses hyperparameter sensitivity and uncontrolled costs.

Entities

Institutions

  • arXiv

Sources