Token-Selective Attention Reduces Transformer Compute by 20%
A novel approach called Token-Selective Attention (TSA) has been introduced by researchers, enabling transformer models to selectively bypass computations for simpler tokens. This technique incorporates a two-layer MLP gate for each token between transformer blocks, resulting in a continuous halting probability with just a 1.7% increase in parameters. Importantly, TSA is fully differentiable and does not necessitate modifications to the underlying architecture. Remarkably, TSA autonomously skips 20% of token-layer operations without any explicit depth regularization (λ=0), relying solely on the task-loss gradient. In character-level language modeling tests, TSA reduced token-layer operations (TLOps) by 14-23% on Tiny-Shakespeare and enwik8, maintaining less than 0.5% quality loss. When comparing efficiencies, TSA recorded a 0.7% lower validation loss than early exit techniques. This research is available on arXiv (2605.05222).
Key facts
- TSA adds a learned per-token gate on residual updates between transformer blocks.
- The gate is a two-layer MLP producing a continuous halting probability.
- Parameter overhead is only 1.7%.
- No changes to the base architecture are required.
- At λ=0, TSA skips 20% of token-layer operations.
- On Tiny-Shakespeare and enwik8, TSA saves 14-23% TLOps with <0.5% quality loss.
- At matched efficiency, TSA achieves 0.7% lower validation loss than early exit.
- Paper available on arXiv: 2605.05222.
Entities
Institutions
- arXiv