Transformer Training Reveals Transient Compression Waves and Persistent Spectral Gradients

other · 2026-04-29

A new study on arXiv (2604.22778) presents the first systematic analysis of weight matrix singular value spectra during transformer pretraining. Tracking full SVD decompositions every 25 steps across models from 30M to 285M parameters, researchers identified three phenomena: transient compression waves where stable rank compression travels from early to late layers and reverses; persistent spectral gradients forming a non-monotonic inverted-U shape in deeper models; and a Q/K-V functional asymmetry where value/output projections compress uniformly while query/key projections carry depth-dependent dynamics. The dissociation between transient compression and persistent gradients reveals new insights into transformer learning dynamics.

Key facts

First systematic study of weight matrix singular value spectra during transformer pretraining
Full SVD decompositions tracked at 25-step intervals
Models scaled from 30M to 285M parameters
Transient compression waves propagate from early to late layers
Compression gradient peaks early then reverses
Late layers eventually over-compress past early layers
Power-law exponent α develops permanent depth gradient
Inverted-U shape in deeper models with peaks shifting toward earlier layers
Value/output projections compress uniformly
Query/key projections carry full depth-dependent dynamics

Transformer Training Reveals Transient Compression Waves and Persistent Spectral Gradients

Key facts

Entities

Institutions

Sources