Asymmetric Weight Traces in Transformer Alignment

ai-technology · 2026-05-20

A recent study published on arXiv (2605.16600) indicates that cross-entropy pretraining and preference alignment create distinct geometric patterns in the weights of transformers. The researchers propose a relative-subspace-fraction probe to evaluate how changes in weights correspond with the activation subspaces of the residual stream and the prediction subspace derived from unembedding. Notably, alignment updates are focused in the read pathway (W_Q, W_K), following the main directions of attention-input activations, while they remain nearly isotropic in the write pathway (W_O, W_2) in relation to the prediction subspace. This observed asymmetry results from anisotropic gradient accumulation, where updates to a matrix W are composed of outer products δ_t a_t^T, reflecting directional characteristics from the side with concentrated covariance. In trained transformers, the input activation a_t exhibits heightened covariance, leading to objective-agnostic alignment.

Key facts

Paper arXiv:2605.16600
Cross-entropy pretraining and preference alignment leave distinct geometric traces
Relative-subspace-fraction probe introduced
Alignment deltas concentrate in read pathway (W_Q, W_K)
Write pathway (W_O, W_2) remains near-isotropic
Anisotropic gradient accumulation explains pattern
Updates are sums of outer products δ_t a_t^T
Input activation covariance is spiked in trained transformers

Asymmetric Weight Traces in Transformer Alignment

Key facts

Entities

Institutions

Sources