S2O: Early Stopping for Sparse Attention via Online Permutation

other · 2026-05-07

A novel approach known as S2O (early stopping for sparse attention via online permutation) has been developed to tackle the quadratic scaling issues of attention related to sequence length during long-context inference. Current block-granularity sparsification techniques reduce latency but are limited by an inherent sparsity ceiling due to their coarse block structure. Drawing inspiration from virtual-to-physical address mapping in memory systems, S2O reexamines and factorizes FlashAttention execution. This allows inference to load non-contiguous tokens instead of a continuous sequence. The method shifts explicit permutation into an online, index-guided discrete loading strategy, focusing on a select few high-priority blocks with minimal preprocessing and index-remapping overhead. The approach is driven by fine-grained patterns in attention heatmaps. The research can be found on arXiv under identifier 2602.22575.

Key facts

S2O stands for early stopping for sparse attention via online permutation.
Attention scales quadratically with sequence length, limiting long-context inference.
Existing block-granularity sparsification has an intrinsic sparsity ceiling.
S2O is inspired by virtual-to-physical address mapping in memory systems.
S2O factorizes FlashAttention execution.
S2O loads non-contiguous tokens instead of a contiguous span.
The method uses an online, index-guided, discrete loading policy.
The paper is on arXiv with ID 2602.22575.

S2O: Early Stopping for Sparse Attention via Online Permutation

Key facts

Entities

Institutions

Sources