New Parameterization Enables Learning Rate Transfer in Normalized Transformers
Researchers have developed νGPT, a novel parameterization for Normalized Transformers (nGPT) that achieves learning rate transfer across width, depth, and token horizon. The original nGPT, introduced in arXiv:2410.01131, offers impressive training speedups without weight decay or learning rate warmup, but fails to transfer learning rates across model dimensions and token horizons. By combining numerical experiments with alignment exponents from arXiv:2407.05872, the team modified the μP approach to hyperparameter transfer (arXiv:2011.14522). Extensive empirical validation shows νGPT successfully exhibits learning rate transfer, addressing a key limitation of nGPT.
Key facts
- νGPT is a new parameterization for Normalized Transformers (nGPT).
- nGPT was introduced in arXiv:2410.01131.
- nGPT achieves training speedups without weight decay or learning rate warmup.
- nGPT did not exhibit learning rate transfer across model dimension and token horizon.
- The research combines numerical experiments with alignment exponents (arXiv:2407.05872).
- The μP approach to hyperparameter transfer (arXiv:2011.14522) was modified.
- νGPT exhibits learning rate transfer across width, depth, and token horizon.
- Extensive empirical validation supports the findings.
Entities
Institutions
- arXiv