SPpruner: Subject-Centric Visual Token Reduction for VLMs

other · 2026-05-22

A new method called SPpruner reduces computational costs in Vision-Language Models (VLMs) by progressively pruning visual tokens. It mimics human visual perception's focus-then-context mechanism. A focus identification module models visual saliency and semantic relevance to preserve high-fidelity subject representation. A context-aware structural scanning module then aggregates contextual cues. The approach aims to maintain salient subjects and their relationships while reducing token count, addressing the bottleneck of massive visual token sequences during inference.

Key facts

SPpruner is a subject-centric progressive reduction paradigm for VLMs.
It emulates the Focus-then-Context mechanism of human visual perception.
A focus identification module models interplay between visual saliency and semantic relevance.
A context-aware structural scanning module aggregates contextual cues.
The method aims to reduce computational costs from massive visual token sequences.
It preserves high-fidelity representation of visual input.
The approach explores salient subjects and their contextual relationships.
The paper is available on arXiv with ID 2605.20950.

Entities

—

Sources

arXiv cs.AI — 2026-05-21