Exact Linear Attention: A New Transformer Mechanism

other · 2026-05-20

A recent study introduces Exact Linear Attention (ELA), a novel method that achieves linear computational efficiency for Transformer attention by leveraging the exact decomposition of kernel functions, which removes approximation errors. This innovation addresses problems like gradient explosion and token attention dilution seen in earlier linear attention methods by enforcing kernel constraints that ensure non-negativity, discriminability, and clear geometric meaning. The new kernels include the Hadamard Exp Kernel, the Summation Squared Euclidean Distance Kernel, and the Subtraction Squared Euclidean Distance Kernel. Furthermore, improvements feature a Hyper Link structure that substitutes traditional residual connections to mitigate gradient degradation, along with a Memory Lobe module that uses bidirectional linear attention to track transformation flow across layers.

Key facts

Exact Linear Attention achieves linear computational complexity for Transformer attention.
It uses exact decomposition of kernel functions without approximation error.
Addresses gradient explosion and token attention dilution in prior linear attention methods.
Kernel constraints ensure non-negativity, discriminability, and geometric interpretability.
Proposed kernels: Hadamard Exp Kernel, Summation Squared Euclidean Distance Kernel, Subtraction Squared Euclidean Distance Kernel.
Hyper Link structure replaces traditional residual connections to mitigate gradient degradation.
Memory Lobe module based on bidirectional linear attention captures transformation flow across layers.
Paper published on arXiv with ID 2605.18848.

Entities

—

Sources

arXiv cs.AI — 2026-05-20