Stochastic Sharpness Gap in SGD Training of Neural Networks

ai-technology · 2026-04-25

A new arXiv preprint (2604.21016) investigates why mini-batch stochastic gradient descent (SGD) stabilizes sharpness below the 2/η threshold observed in full-batch gradient descent (GD). In GD, sharpness rises to 2/η and hovers there, a phenomenon known as the Edge of Stability (EoS), explained by a self-stabilization mechanism driven by third-order loss structure (Damian et al., 2023). For SGD, sharpness stabilizes below 2/η, with the gap widening as batch size decreases, but no theoretical explanation existed. The authors introduce stochastic self-stabilization, extending the framework to SGD. Their key insight is that gradient noise injects variance into oscillatory dynamics along the top Hessian eigenvector, strengthening the self-stabilization effect and suppressing sharpness. The paper provides a theoretical foundation for the stochastic sharpness gap, linking batch size to sharpness suppression.

Key facts

arXiv:2604.21016
Full-batch GD sharpness rises to 2/η (Edge of Stability)
Damian et al. (2023) explained EoS via self-stabilization from third-order loss structure
SGD sharpness stabilizes below 2/η
Sharpness gap widens as batch size decreases
No prior theoretical explanation for SGD sharpness suppression
Authors propose stochastic self-stabilization
Gradient noise injects variance into top Hessian eigenvector dynamics

Stochastic Sharpness Gap in SGD Training of Neural Networks

Key facts

Entities

Institutions

Sources