MISA: Efficient Sparse Attention for Long-Context LLMs
MISA (Mixture of Indexer Sparse Attention) is a new method that improves the efficiency of sparse attention in large language models for long contexts. It replaces the indexer in DeepSeek Sparse Attention (DSA), which uses many query heads (e.g., 64 on DeepSeek-V3.2) that all score every prefix token, making it costly. MISA treats these indexer heads as a mixture-of-experts, using a lightweight router with cheap block-level statistics to select only a few active heads per query. Only those heads perform the heavy token-level scoring, preserving diversity while reducing per-query cost. The work is published on arXiv as 2605.07363.
Key facts
- MISA is a drop-in replacement for the DSA indexer.
- DSA uses many query heads (e.g., 64 on DeepSeek-V3.2) that share the same selected token set.
- The multi-head design makes the indexer the dominant cost on long contexts.
- MISA treats indexer heads as a mixture-of-experts.
- A lightweight router uses cheap block-level statistics to pick a query-dependent subset of active heads.
- Only the selected heads run the heavy token-level scoring.
- MISA reduces per-query cost compared to scoring every prefix token with every head.
- The paper is on arXiv with ID 2605.07363.
Entities
Institutions
- arXiv