ARTFEED — Contemporary Art Intelligence

Attention Mechanism Decomposed into Routing and Filtering Components

ai-technology · 2026-05-20

A recent study published on arXiv (2605.18826) breaks down the attention interaction matrix QK^T into two separate functions: a skew-symmetric part that manages information routing between positions and a symmetric part that filters for mutual relevance. By examining 1776 heads from five pretrained transformers, researchers discovered that routing functions at a low rank, beneath the limits set by weight kernels. They propose S-D attention, a diagnostic framework that separates routing from filtering while ensuring stability and training without layer normalization. When separated and unnormalized, routing forms a spectral cascade, starting with an effective rank of 2 at the initial layer and growing deeper across six model scales from 7M to 355M parameters. The cascade indicates where attention can be streamlined, as linearizing the first seven layers of a 125M-parameter S-D attention model incurs less than 5% performance loss.

Key facts

  • arXiv paper 2605.18826 decomposes attention into routing (skew-symmetric) and filtering (symmetric) components.
  • 1776 attention heads across five pretrained transformers were analyzed.
  • Routing operates at low rank, below the capacity allocated by weight kernels.
  • S-D attention disentangles routing from filtering with guaranteed stability.
  • S-D attention trains stably without layer normalization.
  • Routing self-organizes into a spectral cascade when disentangled and unnormalized.
  • Effective rank is 2 at the first layer and expands with depth.
  • Cascade observed across six model scales from 7M to 355M parameters.
  • Linearizing first seven layers of 125M S-D attention costs <5% performance loss.

Entities

Institutions

  • arXiv

Sources