Quantifying Hyperparameter Transfer in LLM Training

ai-technology · 2026-05-22

A new arXiv paper (2605.21486) introduces a framework to quantify hyperparameter transfer in large language model training, focusing on the role of embedding layer learning rate. The authors develop three metrics: scaling law fit quality, robustness to extrapolation errors, and asymptotic loss penalty from parameterization choice. Through comprehensive ablations, they investigate why Maximal Update parameterization (μP) outperforms standard parameterization (SP) with AdamW, finding that the embedding layer learning rate is a critical factor. The study addresses gaps in existing theory and provides practical insights for scaling optimization hyperparameters.

Key facts

arXiv paper 2605.21486
Hyperparameter transfer allows extrapolating optimal hyperparameters from small to large scales
Three metrics developed: scaling law fit quality, robustness to extrapolation errors, asymptotic loss penalty
Maximal Update parameterization (μP) compared to standard parameterization (SP)
Training with AdamW optimizer
Embedding layer learning rate identified as critical factor
Comprehensive ablation studies conducted
Existing theory inadequate to explain μP benefits

Quantifying Hyperparameter Transfer in LLM Training

Key facts

Entities

Institutions

Sources