SANTA: Stochastic Sparse Attention Speeds Up LLM Inference
Researchers propose Stochastic Additive No-mulT Attention (SANTA), a method to accelerate memory-bound autoregressive decoding in large language models. By sampling S indices from the post-softmax distribution and aggregating only those value rows, SANTA replaces multiply-accumulates with gather-and-add, yielding an unbiased estimator. Stratified sampling produces variance-reduced, GPU-friendly variants. On an NVIDIA RTX 6000 Ada, SANTA achieves 1.5× decode-step attention kernel speedup over FlashInfer and FlashDecoding at 32k-token contexts while matching baseline accuracy. Bernoulli qKᵀ sampling is also introduced as a complementary technique to sparsify the score stage via stochastic ternary queries.
Key facts
- SANTA sparsifies value-cache access by sampling S << n_k indices from post-softmax distribution.
- Method replaces value-stage multiply-accumulates with gather-and-add.
- Stratified sampling yields variance-reduced, GPU-friendly variants.
- 1.5× decode-step attention kernel speedup over FlashInfer and FlashDecoding on NVIDIA RTX 6000 Ada.
- Matches baseline accuracy at 32k-token contexts.
- Bernoulli qKᵀ sampling sparsifies the score stage.
- Paper appears on arXiv as 2605.01910.
- Focus on memory-bound inference for long contexts.
Entities
Institutions
- arXiv