LightKV: Reducing Vision Token KV Cache in LVLMs
Researchers propose LightKV, a method to reduce the Key-Value (KV) cache size in Large Vision-Language Models (LVLMs) by exploiting redundancy among vision-token embeddings. Guided by text prompts, LightKV uses cross-modality message passing to aggregate and compress vision tokens during the prefill stage, distinguishing it from prior vision-only compression strategies. Evaluated on eight open-source LVLMs across eight benchmarks including MME and SeedBench, LightKV achieves performance with only 55% of the original vision tokens, significantly reducing GPU memory overhead.
Key facts
- LightKV reduces KV cache size in LVLMs.
- It uses cross-modality message passing guided by text prompts.
- Evaluated on eight open-source LVLMs and eight benchmarks.
- Achieves performance with 55% of original vision tokens.
- Addresses GPU memory overhead from vision tokens.
- Distinguished from prior vision-only compression methods.
- Tested on MME and SeedBench datasets.
- Published on arXiv with ID 2605.00789.
Entities
Institutions
- arXiv