Geometric Continuity in Deep Neural Networks Explained
A recent investigation published on arXiv (2605.04971) uncovers the reasons behind the geometric continuity observed in weight matrices of deep networks, where the principal singular vectors of neighboring layers align closely. Researchers conducted experiments on simple MLPs and compact transformers, revealing two key factors: residual connections promote gradient coherence across layers, aligning weight updates, and symmetry-breaking nonlinearities limit all layers to a unified coordinate system, thus averting rotation drift. Notably, a nonlinear activation that preserves rotation does not maintain continuity, highlighting symmetry breaking as the crucial element. While activation focuses continuity in the primary singular direction, normalization spreads it across various directions.
Key facts
- Weight matrices in deep networks show geometric continuity: principal singular vectors of adjacent layers point in similar directions.
- The origin of this property was previously unexplained.
- Experiments were conducted on toy MLPs and small transformers.
- Residual connections create cross-layer gradient coherence aligning weight updates.
- Symmetry-breaking nonlinearities constrain all layers to a shared coordinate frame.
- A nonlinear but rotation-preserving activation fails to retain continuity.
- Activation concentrates continuity in the leading singular direction.
- Normalization distributes continuity across multiple directions.
Entities
Institutions
- arXiv