ARTFEED — Contemporary Art Intelligence

Quantifying Hyperparameter Transfer in LLM Training

ai-technology · 2026-05-22

A new arXiv paper (2605.21486) introduces a framework to quantify hyperparameter transfer in large language model training, focusing on the role of embedding layer learning rate. The authors develop three metrics: scaling law fit quality, robustness to extrapolation errors, and asymptotic loss penalty from parameterization choice. Through comprehensive ablations, they investigate why Maximal Update parameterization (μP) outperforms standard parameterization (SP) with AdamW, finding that the embedding layer learning rate is a critical factor. The study addresses gaps in existing theory and provides practical insights for scaling optimization hyperparameters.

Key facts

  • arXiv paper 2605.21486
  • Hyperparameter transfer allows extrapolating optimal hyperparameters from small to large scales
  • Three metrics developed: scaling law fit quality, robustness to extrapolation errors, asymptotic loss penalty
  • Maximal Update parameterization (μP) compared to standard parameterization (SP)
  • Training with AdamW optimizer
  • Embedding layer learning rate identified as critical factor
  • Comprehensive ablation studies conducted
  • Existing theory inadequate to explain μP benefits

Entities

Institutions

  • arXiv

Sources