KV Cache Compression for Vision-Language Models
A new research paper on arXiv (2605.16439) introduces KVCapsule, a method for efficient sequential KV cache compression in Vision-Language Models (VLMs). VLMs extend Large Language Models (LLMs) to multimodal reasoning with text and image inputs, but suffer from high memory overhead due to large key-value caches during autoregressive decoding. Images produce longer token sequences and denser feature representations than text, and vision tokens exhibit structured attention patterns that render many LLM-oriented compression techniques ineffective. The authors conduct empirical analysis of vision token behavior and propose KVCapsule to address these challenges.
Key facts
- Paper on arXiv: 2605.16439
- Title: KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy
- Focuses on KV cache compression for VLMs
- VLMs extend LLMs to multimodal reasoning
- Images produce longer token sequences and denser features
- Vision tokens have structured attention patterns
- LLM-oriented compression techniques are ineffective for VLMs
- Proposes KVCapsule based on empirical analysis
Entities
Institutions
- arXiv