Spectral Shaping Improves Muon Optimizer for LLM Training
A new arXiv paper introduces DynMuon, a variant of the Muon optimizer that applies spectral shaping to the update matrix. The standard Muon method replaces the gradient update matrix M = UΣV^T with its polar factor UV^T. DynMuon generalizes this by using UΣ^p V^T, where p is a parameter adjusted based on local curvature, stochastic gradient noise, and training stage. The theory and experiments show that positive p values accelerate early training by emphasizing high-curvature directions, while mildly negative p values benefit later stages by shifting focus to low-curvature directions. This previously overlooked behavior offers a dynamic way to improve convergence in large language model training.
Key facts
- Muon is the dominant method for training large language models.
- Standard Muon replaces the update matrix with its polar factor UV^T.
- DynMuon uses UΣ^p V^T for spectral shaping.
- Parameter p depends on local curvature, noise, and training stage.
- Positive p helps early training by emphasizing high-curvature directions.
- Mildly negative p helps later training by focusing on low-curvature directions.
- The paper is arXiv:2605.17109.
- The work reveals a previously overlooked behavior in Muon-like updates.
Entities
Institutions
- arXiv