CIVIC: Efficient Vision-Language Model via Compact Visual Inference
A new paper introduces CIVIC, a path-consistent compact visual inference framework for Vision-Language Models (VLMs). It addresses memory and latency bottlenecks from high-resolution visual tokens by maintaining compact sequence representations across the vision encoder, projection layer, LLM prefill, and KV-cache. This avoids non-contiguous memory access and localized unmerging overheads, translating sequence reductions into genuine hardware efficiency. Evaluated on Qwen3-VL, CIVIC reduces KV-cache memory to about one-third of baseline and cuts end-to-end inference time. The paper is available on arXiv.
Key facts
- CIVIC is a path-consistent compact visual inference framework for VLMs.
- It maintains compact sequences across vision encoder, projection layer, LLM prefill, and KV-cache.
- Avoids non-contiguous memory access and localized unmerging overheads.
- Evaluated on Qwen3-VL architecture.
- KV-cache memory reduced to approximately one-third of baseline.
- End-to-end inference time is reduced.
- Paper published on arXiv with ID 2605.28115.
Entities
Institutions
- arXiv