Adaptive Mass-Segmented KV Compression for Long-Context Reasoning
arXiv paper 2605.23200 introduces Adaptive Mass-Segmented (AMS) KV Compression, a method to address the linear growth of the Key-Value (KV) cache in long-form LLM inference. The authors identify that existing compression methods relying on global Top-k selection cause Region Wipe-out, where contiguous reasoning blocks are severely evicted, disrupting logical coherence. AMS shifts from token-level competition to region-aware quota allocation by adaptively partitioning the KV cache based on attention mass distribution, ensuring vital reasoning segments receive guaranteed memory. An EMA-based smoothing mechanism prevents jitter in segment boundaries during iterative decoding. AMS is a universal plug-and-play layer orthogonal to existing scorers.
Key facts
- arXiv paper 2605.23200 proposes Adaptive Mass-Segmented (AMS) KV Compression
- Addresses linear growth of KV cache in long-form LLM inference
- Existing Top-k selection causes Region Wipe-out of contiguous reasoning blocks
- AMS shifts from token-level competition to region-aware quota allocation
- Partitions KV cache based on spatial distribution of attention mass
- EMA-based smoothing mechanism prevents jitter in segment boundaries
- AMS is a universal plug-and-play layer orthogonal to existing scorers
Entities
Institutions
- arXiv