ARTFEED — Contemporary Art Intelligence

CIVIC: Efficient Vision-Language Model via Compact Visual Inference

ai-technology · 2026-05-28

A new paper introduces CIVIC, a path-consistent compact visual inference framework for Vision-Language Models (VLMs). It addresses memory and latency bottlenecks from high-resolution visual tokens by maintaining compact sequence representations across the vision encoder, projection layer, LLM prefill, and KV-cache. This avoids non-contiguous memory access and localized unmerging overheads, translating sequence reductions into genuine hardware efficiency. Evaluated on Qwen3-VL, CIVIC reduces KV-cache memory to about one-third of baseline and cuts end-to-end inference time. The paper is available on arXiv.

Key facts

  • CIVIC is a path-consistent compact visual inference framework for VLMs.
  • It maintains compact sequences across vision encoder, projection layer, LLM prefill, and KV-cache.
  • Avoids non-contiguous memory access and localized unmerging overheads.
  • Evaluated on Qwen3-VL architecture.
  • KV-cache memory reduced to approximately one-third of baseline.
  • End-to-end inference time is reduced.
  • Paper published on arXiv with ID 2605.28115.

Entities

Institutions

  • arXiv

Sources