Adaptive Mass-Segmented KV Compression for Long-Context Reasoning

ai-technology · 2026-05-25

arXiv paper 2605.23200 introduces Adaptive Mass-Segmented (AMS) KV Compression, a method to address the linear growth of the Key-Value (KV) cache in long-form LLM inference. The authors identify that existing compression methods relying on global Top-k selection cause Region Wipe-out, where contiguous reasoning blocks are severely evicted, disrupting logical coherence. AMS shifts from token-level competition to region-aware quota allocation by adaptively partitioning the KV cache based on attention mass distribution, ensuring vital reasoning segments receive guaranteed memory. An EMA-based smoothing mechanism prevents jitter in segment boundaries during iterative decoding. AMS is a universal plug-and-play layer orthogonal to existing scorers.

Key facts

arXiv paper 2605.23200 proposes Adaptive Mass-Segmented (AMS) KV Compression
Addresses linear growth of KV cache in long-form LLM inference
Existing Top-k selection causes Region Wipe-out of contiguous reasoning blocks
AMS shifts from token-level competition to region-aware quota allocation
Partitions KV cache based on spatial distribution of attention mass
EMA-based smoothing mechanism prevents jitter in segment boundaries
AMS is a universal plug-and-play layer orthogonal to existing scorers

Adaptive Mass-Segmented KV Compression for Long-Context Reasoning

Key facts

Entities

Institutions

Sources