μP Scaling Laws Derived for Grouped-Query Attention in LLMs
A new arXiv paper (2605.15290) extends the maximal update parameterization (μP) to grouped-query attention (GQA), a key architecture in large language models. The authors advance the spectral feature-learning framework by promoting spectral norm conditions from heuristic to definition, deriving Complete-P depth and weight-decay scalings without lazy learning. They also introduce a modified spectral norm that preserves valid scaling laws for non-full-rank weight matrices, enabling the first derivation of μP scalings for GQA. The work demonstrates efficacy through empirical results, reducing compute needed for hyperparameter transfer across model architectures.
Key facts
- Paper arXiv:2605.15290v1 published on arXiv.
- Focuses on hyperparameter transfer for LLMs using μP.
- Derives Complete-P depth and weight-decay scalings from spectral norm conditions.
- Introduces modified spectral norm for non-full-rank matrices.
- First derivation of μP scalings for grouped-query attention (GQA).
- Builds on spectral feature-learning view of Yang et al. (2023a).
- Aims to reduce compute for tuning LLMs across architectures.
- Demonstrates efficacy with empirical results.
Entities
Institutions
- arXiv