ARTFEED — Contemporary Art Intelligence

μP Scaling Laws Derived for Grouped-Query Attention in LLMs

ai-technology · 2026-05-18

A new arXiv paper (2605.15290) extends the maximal update parameterization (μP) to grouped-query attention (GQA), a key architecture in large language models. The authors advance the spectral feature-learning framework by promoting spectral norm conditions from heuristic to definition, deriving Complete-P depth and weight-decay scalings without lazy learning. They also introduce a modified spectral norm that preserves valid scaling laws for non-full-rank weight matrices, enabling the first derivation of μP scalings for GQA. The work demonstrates efficacy through empirical results, reducing compute needed for hyperparameter transfer across model architectures.

Key facts

  • Paper arXiv:2605.15290v1 published on arXiv.
  • Focuses on hyperparameter transfer for LLMs using μP.
  • Derives Complete-P depth and weight-decay scalings from spectral norm conditions.
  • Introduces modified spectral norm for non-full-rank matrices.
  • First derivation of μP scalings for grouped-query attention (GQA).
  • Builds on spectral feature-learning view of Yang et al. (2023a).
  • Aims to reduce compute for tuning LLMs across architectures.
  • Demonstrates efficacy with empirical results.

Entities

Institutions

  • arXiv

Sources