Attention Mechanism Decomposed into Routing and Filtering Components

ai-technology · 2026-05-20

A recent study published on arXiv (2605.18826) breaks down the attention interaction matrix QK^T into two separate functions: a skew-symmetric part that manages information routing between positions and a symmetric part that filters for mutual relevance. By examining 1776 heads from five pretrained transformers, researchers discovered that routing functions at a low rank, beneath the limits set by weight kernels. They propose S-D attention, a diagnostic framework that separates routing from filtering while ensuring stability and training without layer normalization. When separated and unnormalized, routing forms a spectral cascade, starting with an effective rank of 2 at the initial layer and growing deeper across six model scales from 7M to 355M parameters. The cascade indicates where attention can be streamlined, as linearizing the first seven layers of a 125M-parameter S-D attention model incurs less than 5% performance loss.

Key facts

arXiv paper 2605.18826 decomposes attention into routing (skew-symmetric) and filtering (symmetric) components.
1776 attention heads across five pretrained transformers were analyzed.
Routing operates at low rank, below the capacity allocated by weight kernels.
S-D attention disentangles routing from filtering with guaranteed stability.
S-D attention trains stably without layer normalization.
Routing self-organizes into a spectral cascade when disentangled and unnormalized.
Effective rank is 2 at the first layer and expands with depth.
Cascade observed across six model scales from 7M to 355M parameters.
Linearizing first seven layers of 125M S-D attention costs <5% performance loss.

Attention Mechanism Decomposed into Routing and Filtering Components

Key facts

Entities

Institutions

Sources