LiteLVLM: Training-Free Token Pruning for Efficient Pixel Grounding in Vision-Language Models

ai-technology · 2026-05-14

A novel approach called LiteLVLM has been developed by researchers, offering a training-free, text-guided token pruning technique aimed at enhancing pixel grounding inference within large vision-language models. This strategy tackles the computational burden associated with visual tokens by inverting the ranking of CLIP's visual-text similarity. This allows for the preservation of tokens that encompass referent areas while also retrieving context tokens to ensure a distinct separation between foreground and background. Comprehensive experiments validate the method's effectiveness.

Key facts

LiteLVLM is a training-free token pruning strategy.
It targets pixel grounding tasks in large vision-language models.
The method reverses CLIP's visual-text similarity ranking.
It retains visual tokens covering referent regions.
It recovers context tokens for foreground-background separation.
The approach addresses computational overhead from visual tokens.
The research is published on arXiv with ID 2605.13178.
The method is text-guided.

LiteLVLM: Training-Free Token Pruning for Efficient Pixel Grounding in Vision-Language Models

Key facts

Entities

Institutions

Sources