ARTFEED — Contemporary Art Intelligence

SANTA: Stochastic Sparse Attention Speeds Up LLM Inference

ai-technology · 2026-05-06

Researchers propose Stochastic Additive No-mulT Attention (SANTA), a method to accelerate memory-bound autoregressive decoding in large language models. By sampling S indices from the post-softmax distribution and aggregating only those value rows, SANTA replaces multiply-accumulates with gather-and-add, yielding an unbiased estimator. Stratified sampling produces variance-reduced, GPU-friendly variants. On an NVIDIA RTX 6000 Ada, SANTA achieves 1.5× decode-step attention kernel speedup over FlashInfer and FlashDecoding at 32k-token contexts while matching baseline accuracy. Bernoulli qKᵀ sampling is also introduced as a complementary technique to sparsify the score stage via stochastic ternary queries.

Key facts

  • SANTA sparsifies value-cache access by sampling S << n_k indices from post-softmax distribution.
  • Method replaces value-stage multiply-accumulates with gather-and-add.
  • Stratified sampling yields variance-reduced, GPU-friendly variants.
  • 1.5× decode-step attention kernel speedup over FlashInfer and FlashDecoding on NVIDIA RTX 6000 Ada.
  • Matches baseline accuracy at 32k-token contexts.
  • Bernoulli qKᵀ sampling sparsifies the score stage.
  • Paper appears on arXiv as 2605.01910.
  • Focus on memory-bound inference for long contexts.

Entities

Institutions

  • arXiv

Sources