Attention Mechanism Decomposed into Routing and Filtering Components
A recent study published on arXiv (2605.18826) breaks down the attention interaction matrix QK^T into two separate functions: a skew-symmetric part that manages information routing between positions and a symmetric part that filters for mutual relevance. By examining 1776 heads from five pretrained transformers, researchers discovered that routing functions at a low rank, beneath the limits set by weight kernels. They propose S-D attention, a diagnostic framework that separates routing from filtering while ensuring stability and training without layer normalization. When separated and unnormalized, routing forms a spectral cascade, starting with an effective rank of 2 at the initial layer and growing deeper across six model scales from 7M to 355M parameters. The cascade indicates where attention can be streamlined, as linearizing the first seven layers of a 125M-parameter S-D attention model incurs less than 5% performance loss.
Key facts
- arXiv paper 2605.18826 decomposes attention into routing (skew-symmetric) and filtering (symmetric) components.
- 1776 attention heads across five pretrained transformers were analyzed.
- Routing operates at low rank, below the capacity allocated by weight kernels.
- S-D attention disentangles routing from filtering with guaranteed stability.
- S-D attention trains stably without layer normalization.
- Routing self-organizes into a spectral cascade when disentangled and unnormalized.
- Effective rank is 2 at the first layer and expands with depth.
- Cascade observed across six model scales from 7M to 355M parameters.
- Linearizing first seven layers of 125M S-D attention costs <5% performance loss.
Entities
Institutions
- arXiv