ARTFEED — Contemporary Art Intelligence

Cascade Token Selection Speeds Up Transformer Attention

ai-technology · 2026-05-07

A new method called Cascade Token Selection reduces the computational cost of selecting representative tokens in transformer attention layers. The approach, detailed in a paper on arXiv (2605.03110), builds on Activation Decorrelation Attention (ADA), which selects r representative tokens per layer using a Gram threshold but requires a costly T×T Gram matrix at each layer. The cascade mechanism inherits the representative set from layer l to layer l+1, validates it via a (T−r)×r cross-Gram computation, and updates it with minimal additions and removals. This reduces selection cost from O(T²d) to O(Trd) per layer. Tests on GPT-2 124M, GPT-J 6B, and OPT 6.7B models using AMD MI300X hardware showed Gram operation savings of 22% to 63%, with mean Jaccard overlap between consecutive layers ranging from 0.83 to 0.94.

Key facts

  • Cascade Token Selection reduces cost of representative token selection in transformer attention layers.
  • Method exploits coherence of representative set across depth.
  • Builds on Activation Decorrelation Attention (ADA) which selects r representative tokens per layer via Gram threshold.
  • ADA requires T×T Gram matrix at every layer.
  • Cascade inherits representative set from layer l to layer l+1.
  • Validates set via (T−r)×r cross-Gram computation.
  • Updates set with small number of additions and removals.
  • Selection cost drops from O(T²d) to O(Trd) per layer.
  • Tested on GPT-2 124M, GPT-J 6B, OPT 6.7B models.
  • Hardware used: AMD MI300X.
  • Gram operation savings: 22% to 63%.
  • Mean Jaccard overlap between consecutive layers: 0.83 to 0.94.

Entities

Institutions

  • arXiv

Sources