ARTFEED — Contemporary Art Intelligence

S2O: Early Stopping for Sparse Attention via Online Permutation

other · 2026-05-07

A novel approach known as S2O (early stopping for sparse attention via online permutation) has been developed to tackle the quadratic scaling issues of attention related to sequence length during long-context inference. Current block-granularity sparsification techniques reduce latency but are limited by an inherent sparsity ceiling due to their coarse block structure. Drawing inspiration from virtual-to-physical address mapping in memory systems, S2O reexamines and factorizes FlashAttention execution. This allows inference to load non-contiguous tokens instead of a continuous sequence. The method shifts explicit permutation into an online, index-guided discrete loading strategy, focusing on a select few high-priority blocks with minimal preprocessing and index-remapping overhead. The approach is driven by fine-grained patterns in attention heatmaps. The research can be found on arXiv under identifier 2602.22575.

Key facts

  • S2O stands for early stopping for sparse attention via online permutation.
  • Attention scales quadratically with sequence length, limiting long-context inference.
  • Existing block-granularity sparsification has an intrinsic sparsity ceiling.
  • S2O is inspired by virtual-to-physical address mapping in memory systems.
  • S2O factorizes FlashAttention execution.
  • S2O loads non-contiguous tokens instead of a contiguous span.
  • The method uses an online, index-guided, discrete loading policy.
  • The paper is on arXiv with ID 2602.22575.

Entities

Institutions

  • arXiv

Sources