MiMuon Optimizer Enhances Generalization for Large AI Models
A new optimizer called MiMuon (Mixed Muon) has been proposed to improve the generalization of large-scale artificial intelligence models. The Muon optimizer, designed for matrix-structured parameters, converges faster than vector-wise algorithms but lacked established generalization properties. This paper proves that Muon has a generalization error of O(1/(Nκ^T)), where N is training sample size, T is iteration number, and κ is the minimum difference between singular values of gradient estimates. To enhance generalization, the authors introduce MiMuon, which mixes Muon with other techniques. The work is published on arXiv under identifier 2605.19619.
Key facts
- MiMuon is a mixed Muon optimizer for large models.
- Muon optimizer shows faster convergence than vector-wise algorithms.
- Generalization error of Muon is O(1/(Nκ^T)).
- N is training sample size, T is iteration number.
- κ is minimum difference between singular values of gradient estimate.
- The paper proves generalization properties using algorithmic stability and mathematical induction.
- MiMuon aims to improve generalization of Muon.
- Published on arXiv with ID 2605.19619.
Entities
Institutions
- arXiv