Token-Selective Attention Reduces Transformer Compute by 20%

ai-technology · 2026-05-09

A novel approach called Token-Selective Attention (TSA) has been introduced by researchers, enabling transformer models to selectively bypass computations for simpler tokens. This technique incorporates a two-layer MLP gate for each token between transformer blocks, resulting in a continuous halting probability with just a 1.7% increase in parameters. Importantly, TSA is fully differentiable and does not necessitate modifications to the underlying architecture. Remarkably, TSA autonomously skips 20% of token-layer operations without any explicit depth regularization (λ=0), relying solely on the task-loss gradient. In character-level language modeling tests, TSA reduced token-layer operations (TLOps) by 14-23% on Tiny-Shakespeare and enwik8, maintaining less than 0.5% quality loss. When comparing efficiencies, TSA recorded a 0.7% lower validation loss than early exit techniques. This research is available on arXiv (2605.05222).

Key facts

TSA adds a learned per-token gate on residual updates between transformer blocks.
The gate is a two-layer MLP producing a continuous halting probability.
Parameter overhead is only 1.7%.
No changes to the base architecture are required.
At λ=0, TSA skips 20% of token-layer operations.
On Tiny-Shakespeare and enwik8, TSA saves 14-23% TLOps with <0.5% quality loss.
At matched efficiency, TSA achieves 0.7% lower validation loss than early exit.
Paper available on arXiv: 2605.05222.

Token-Selective Attention Reduces Transformer Compute by 20%

Key facts

Entities

Institutions

Sources