SANTA: Stochastic Sparse Attention Speeds Up LLM Inference

ai-technology · 2026-05-06

Researchers propose Stochastic Additive No-mulT Attention (SANTA), a method to accelerate memory-bound autoregressive decoding in large language models. By sampling S indices from the post-softmax distribution and aggregating only those value rows, SANTA replaces multiply-accumulates with gather-and-add, yielding an unbiased estimator. Stratified sampling produces variance-reduced, GPU-friendly variants. On an NVIDIA RTX 6000 Ada, SANTA achieves 1.5× decode-step attention kernel speedup over FlashInfer and FlashDecoding at 32k-token contexts while matching baseline accuracy. Bernoulli qKᵀ sampling is also introduced as a complementary technique to sparsify the score stage via stochastic ternary queries.

Key facts

SANTA sparsifies value-cache access by sampling S << n_k indices from post-softmax distribution.
Method replaces value-stage multiply-accumulates with gather-and-add.
Stratified sampling yields variance-reduced, GPU-friendly variants.
1.5× decode-step attention kernel speedup over FlashInfer and FlashDecoding on NVIDIA RTX 6000 Ada.
Matches baseline accuracy at 32k-token contexts.
Bernoulli qKᵀ sampling sparsifies the score stage.
Paper appears on arXiv as 2605.01910.
Focus on memory-bound inference for long contexts.

SANTA: Stochastic Sparse Attention Speeds Up LLM Inference

Key facts

Entities

Institutions

Sources