S2O: Early Stopping for Sparse Attention via Online Permutation
A novel approach known as S2O (early stopping for sparse attention via online permutation) has been developed to tackle the quadratic scaling issues of attention related to sequence length during long-context inference. Current block-granularity sparsification techniques reduce latency but are limited by an inherent sparsity ceiling due to their coarse block structure. Drawing inspiration from virtual-to-physical address mapping in memory systems, S2O reexamines and factorizes FlashAttention execution. This allows inference to load non-contiguous tokens instead of a continuous sequence. The method shifts explicit permutation into an online, index-guided discrete loading strategy, focusing on a select few high-priority blocks with minimal preprocessing and index-remapping overhead. The approach is driven by fine-grained patterns in attention heatmaps. The research can be found on arXiv under identifier 2602.22575.
Key facts
- S2O stands for early stopping for sparse attention via online permutation.
- Attention scales quadratically with sequence length, limiting long-context inference.
- Existing block-granularity sparsification has an intrinsic sparsity ceiling.
- S2O is inspired by virtual-to-physical address mapping in memory systems.
- S2O factorizes FlashAttention execution.
- S2O loads non-contiguous tokens instead of a contiguous span.
- The method uses an online, index-guided, discrete loading policy.
- The paper is on arXiv with ID 2602.22575.
Entities
Institutions
- arXiv