ARTFEED — Contemporary Art Intelligence

CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models

ai-technology · 2026-05-06

A new framework called CoVSpec has been introduced by researchers to enhance device-edge co-inference of vision-language models (VLMs) through speculative decoding. This method tackles the challenges of applying speculative decoding to VLMs, specifically the high visual-token computation and significant communication overhead. CoVSpec features a novel, training-free technique for reducing visual tokens on mobile devices by evaluating query relevance, token activity, and low-rank dependency. This allows a streamlined draft VLM on a mobile device to work in tandem with a more powerful target VLM located on an edge server, thereby lowering both computational and memory requirements. Details of this research can be found in arXiv paper 2605.02218.

Key facts

  • CoVSpec is a framework for device-edge co-inference of VLMs.
  • It uses speculative decoding with a lightweight draft VLM on mobile and a larger target VLM on edge server.
  • A training-free visual token reduction method prunes redundant tokens.
  • Token reduction considers query relevance, token activity, and low-rank dependency.
  • The approach addresses excessive visual-token computation and communication overhead.
  • The paper is available on arXiv with ID 2605.02218.
  • The work aims to deploy large VLMs on mobile devices.
  • The method does not require additional training.

Entities

Institutions

  • arXiv

Sources