ARTFEED — Contemporary Art Intelligence

Transformer Training Reveals Transient Compression Waves and Persistent Spectral Gradients

other · 2026-04-29

A new study on arXiv (2604.22778) presents the first systematic analysis of weight matrix singular value spectra during transformer pretraining. Tracking full SVD decompositions every 25 steps across models from 30M to 285M parameters, researchers identified three phenomena: transient compression waves where stable rank compression travels from early to late layers and reverses; persistent spectral gradients forming a non-monotonic inverted-U shape in deeper models; and a Q/K-V functional asymmetry where value/output projections compress uniformly while query/key projections carry depth-dependent dynamics. The dissociation between transient compression and persistent gradients reveals new insights into transformer learning dynamics.

Key facts

  • First systematic study of weight matrix singular value spectra during transformer pretraining
  • Full SVD decompositions tracked at 25-step intervals
  • Models scaled from 30M to 285M parameters
  • Transient compression waves propagate from early to late layers
  • Compression gradient peaks early then reverses
  • Late layers eventually over-compress past early layers
  • Power-law exponent α develops permanent depth gradient
  • Inverted-U shape in deeper models with peaks shifting toward earlier layers
  • Value/output projections compress uniformly
  • Query/key projections carry full depth-dependent dynamics

Entities

Institutions

  • arXiv

Sources