CIVIC: Efficient Vision-Language Model via Compact Visual Inference

ai-technology · 2026-05-28

A new paper introduces CIVIC, a path-consistent compact visual inference framework for Vision-Language Models (VLMs). It addresses memory and latency bottlenecks from high-resolution visual tokens by maintaining compact sequence representations across the vision encoder, projection layer, LLM prefill, and KV-cache. This avoids non-contiguous memory access and localized unmerging overheads, translating sequence reductions into genuine hardware efficiency. Evaluated on Qwen3-VL, CIVIC reduces KV-cache memory to about one-third of baseline and cuts end-to-end inference time. The paper is available on arXiv.

Key facts

CIVIC is a path-consistent compact visual inference framework for VLMs.
It maintains compact sequences across vision encoder, projection layer, LLM prefill, and KV-cache.
Avoids non-contiguous memory access and localized unmerging overheads.
Evaluated on Qwen3-VL architecture.
KV-cache memory reduced to approximately one-third of baseline.
End-to-end inference time is reduced.
Paper published on arXiv with ID 2605.28115.

CIVIC: Efficient Vision-Language Model via Compact Visual Inference

Key facts

Entities

Institutions

Sources