Cascade Token Selection Speeds Up Transformer Attention

ai-technology · 2026-05-07

A new method called Cascade Token Selection reduces the computational cost of selecting representative tokens in transformer attention layers. The approach, detailed in a paper on arXiv (2605.03110), builds on Activation Decorrelation Attention (ADA), which selects r representative tokens per layer using a Gram threshold but requires a costly T×T Gram matrix at each layer. The cascade mechanism inherits the representative set from layer l to layer l+1, validates it via a (T−r)×r cross-Gram computation, and updates it with minimal additions and removals. This reduces selection cost from O(T²d) to O(Trd) per layer. Tests on GPT-2 124M, GPT-J 6B, and OPT 6.7B models using AMD MI300X hardware showed Gram operation savings of 22% to 63%, with mean Jaccard overlap between consecutive layers ranging from 0.83 to 0.94.

Key facts

Cascade Token Selection reduces cost of representative token selection in transformer attention layers.
Method exploits coherence of representative set across depth.
Builds on Activation Decorrelation Attention (ADA) which selects r representative tokens per layer via Gram threshold.
ADA requires T×T Gram matrix at every layer.
Cascade inherits representative set from layer l to layer l+1.
Validates set via (T−r)×r cross-Gram computation.
Updates set with small number of additions and removals.
Selection cost drops from O(T²d) to O(Trd) per layer.
Tested on GPT-2 124M, GPT-J 6B, OPT 6.7B models.
Hardware used: AMD MI300X.
Gram operation savings: 22% to 63%.
Mean Jaccard overlap between consecutive layers: 0.83 to 0.94.

Cascade Token Selection Speeds Up Transformer Attention

Key facts

Entities

Institutions

Sources