Muon Optimizer's Spectral Flattening Mechanism Explained
A recent study published on arXiv (2605.13079) indicates that the effectiveness of the Muon optimizer is attributed to its ability to flatten spectra. By utilizing Newton-Schulz iterations, Muon orthogonalizes its momentum buffer, substituting singular values with ones. This adaptation enables Muon to handle larger learning rates and achieve faster convergence compared to traditional optimizers. The researchers demonstrate that the maximum stable step size for Muon is proportional to the average singular value of the gradient, in contrast to the largest value, which limits SGD. Additionally, they redefine Muon as a preconditioned gradient method, illustrating enhanced convergence when applied to a Kronecker-factored curvature model. Experiments validate that Muon maintains stability at learning rates that cause SGD to fail within the initial iterations, reaching accuracy benchmarks several epochs sooner.
Key facts
- Muon orthogonalizes momentum buffer before each update using Newton-Schulz iterations.
- Spectral flattening is the mechanism behind Muon's performance.
- Muon's maximal stable step size scales with average singular value of gradient.
- Standard gradient descent is bottlenecked by the largest singular value.
- Muon is recast as a preconditioned gradient method.
- Improvement is controlled by spectrum of gradient covariance.
- Muon remains stable at learning rates that cause SGD to diverge early.
- Muon reaches accuracy milestones several epochs earlier than SGD.
Entities
Institutions
- arXiv