μP Scaling Laws Derived for Grouped-Query Attention in LLMs

ai-technology · 2026-05-18

A new arXiv paper (2605.15290) extends the maximal update parameterization (μP) to grouped-query attention (GQA), a key architecture in large language models. The authors advance the spectral feature-learning framework by promoting spectral norm conditions from heuristic to definition, deriving Complete-P depth and weight-decay scalings without lazy learning. They also introduce a modified spectral norm that preserves valid scaling laws for non-full-rank weight matrices, enabling the first derivation of μP scalings for GQA. The work demonstrates efficacy through empirical results, reducing compute needed for hyperparameter transfer across model architectures.

Key facts

Paper arXiv:2605.15290v1 published on arXiv.
Focuses on hyperparameter transfer for LLMs using μP.
Derives Complete-P depth and weight-decay scalings from spectral norm conditions.
Introduces modified spectral norm for non-full-rank matrices.
First derivation of μP scalings for grouped-query attention (GQA).
Builds on spectral feature-learning view of Yang et al. (2023a).
Aims to reduce compute for tuning LLMs across architectures.
Demonstrates efficacy with empirical results.

μP Scaling Laws Derived for Grouped-Query Attention in LLMs

Key facts

Entities

Institutions

Sources