Cascade Token Selection Speeds Up Transformer Attention
A new method called Cascade Token Selection reduces the computational cost of selecting representative tokens in transformer attention layers. The approach, detailed in a paper on arXiv (2605.03110), builds on Activation Decorrelation Attention (ADA), which selects r representative tokens per layer using a Gram threshold but requires a costly T×T Gram matrix at each layer. The cascade mechanism inherits the representative set from layer l to layer l+1, validates it via a (T−r)×r cross-Gram computation, and updates it with minimal additions and removals. This reduces selection cost from O(T²d) to O(Trd) per layer. Tests on GPT-2 124M, GPT-J 6B, and OPT 6.7B models using AMD MI300X hardware showed Gram operation savings of 22% to 63%, with mean Jaccard overlap between consecutive layers ranging from 0.83 to 0.94.
Key facts
- Cascade Token Selection reduces cost of representative token selection in transformer attention layers.
- Method exploits coherence of representative set across depth.
- Builds on Activation Decorrelation Attention (ADA) which selects r representative tokens per layer via Gram threshold.
- ADA requires T×T Gram matrix at every layer.
- Cascade inherits representative set from layer l to layer l+1.
- Validates set via (T−r)×r cross-Gram computation.
- Updates set with small number of additions and removals.
- Selection cost drops from O(T²d) to O(Trd) per layer.
- Tested on GPT-2 124M, GPT-J 6B, OPT 6.7B models.
- Hardware used: AMD MI300X.
- Gram operation savings: 22% to 63%.
- Mean Jaccard overlap between consecutive layers: 0.83 to 0.94.
Entities
Institutions
- arXiv