Asymmetric Weight Traces in Transformer Alignment
A recent study published on arXiv (2605.16600) indicates that cross-entropy pretraining and preference alignment create distinct geometric patterns in the weights of transformers. The researchers propose a relative-subspace-fraction probe to evaluate how changes in weights correspond with the activation subspaces of the residual stream and the prediction subspace derived from unembedding. Notably, alignment updates are focused in the read pathway (W_Q, W_K), following the main directions of attention-input activations, while they remain nearly isotropic in the write pathway (W_O, W_2) in relation to the prediction subspace. This observed asymmetry results from anisotropic gradient accumulation, where updates to a matrix W are composed of outer products δ_t a_t^T, reflecting directional characteristics from the side with concentrated covariance. In trained transformers, the input activation a_t exhibits heightened covariance, leading to objective-agnostic alignment.
Key facts
- Paper arXiv:2605.16600
- Cross-entropy pretraining and preference alignment leave distinct geometric traces
- Relative-subspace-fraction probe introduced
- Alignment deltas concentrate in read pathway (W_Q, W_K)
- Write pathway (W_O, W_2) remains near-isotropic
- Anisotropic gradient accumulation explains pattern
- Updates are sums of outer products δ_t a_t^T
- Input activation covariance is spiked in trained transformers
Entities
Institutions
- arXiv