New Adaptive Optimization Method Bridges SGD and Muon
A new paper on arXiv introduces a data-driven criterion for dynamically selecting optimal update geometries in deep neural network optimization. The method unifies existing optimizers like SGD, Muon, Adam, and MuAdam as special cases, using a closed-form criterion derived from gradient and activation statistics via a single-step random feature regression surrogate model. This adaptive approach scales efficiently with computational strategies, potentially improving training dynamics across diverse architectures.
Key facts
- Paper arXiv:2605.19781 introduces adaptive optimization via Schatten-p norms.
- Method dynamically chooses proxy-optimal LMO geometries per layer.
- Criterion derived from gradient and activation statistics using random feature regression.
- Unifies SGD, Muon, Adam, and MuAdam as specific extrema.
- Scalable via efficient computational strategies.
Entities
Institutions
- arXiv