Stochastic Sharpness Gap in SGD Training of Neural Networks
A new arXiv preprint (2604.21016) investigates why mini-batch stochastic gradient descent (SGD) stabilizes sharpness below the 2/η threshold observed in full-batch gradient descent (GD). In GD, sharpness rises to 2/η and hovers there, a phenomenon known as the Edge of Stability (EoS), explained by a self-stabilization mechanism driven by third-order loss structure (Damian et al., 2023). For SGD, sharpness stabilizes below 2/η, with the gap widening as batch size decreases, but no theoretical explanation existed. The authors introduce stochastic self-stabilization, extending the framework to SGD. Their key insight is that gradient noise injects variance into oscillatory dynamics along the top Hessian eigenvector, strengthening the self-stabilization effect and suppressing sharpness. The paper provides a theoretical foundation for the stochastic sharpness gap, linking batch size to sharpness suppression.
Key facts
- arXiv:2604.21016
- Full-batch GD sharpness rises to 2/η (Edge of Stability)
- Damian et al. (2023) explained EoS via self-stabilization from third-order loss structure
- SGD sharpness stabilizes below 2/η
- Sharpness gap widens as batch size decreases
- No prior theoretical explanation for SGD sharpness suppression
- Authors propose stochastic self-stabilization
- Gradient noise injects variance into top Hessian eigenvector dynamics
Entities
Institutions
- arXiv