RetentiveKV: Entropy-Driven KV Cache Eviction for Multimodal LLMs
A new method called RetentiveKV addresses memory and efficiency issues in multimodal large language models by reformulating KV cache eviction as continuous memory evolution rather than discrete pruning. The approach uses information entropy to quantify token importance, overcoming the limitations of the 'persistence of importance' hypothesis, which fails for visual tokens that exhibit deferred importance and spatial continuity. RetentiveKV leverages state space models to maintain a dynamic memory of visual context, preventing premature eviction of tokens that become critical later in decoding. The paper is available on arXiv under ID 2605.04075.
Key facts
- RetentiveKV is a KV cache optimization method for multimodal LLMs.
- It uses entropy-driven eviction based on state space models.
- Addresses 'deferred importance' of visual tokens.
- Replaces discrete pruning with continuous memory evolution.
- Published on arXiv with ID 2605.04075.
- Announce type is cross.
Entities
Institutions
- arXiv