ARTFEED — Contemporary Art Intelligence

LiteLVLM: Training-Free Token Pruning for Efficient Pixel Grounding in Vision-Language Models

ai-technology · 2026-05-14

A novel approach called LiteLVLM has been developed by researchers, offering a training-free, text-guided token pruning technique aimed at enhancing pixel grounding inference within large vision-language models. This strategy tackles the computational burden associated with visual tokens by inverting the ranking of CLIP's visual-text similarity. This allows for the preservation of tokens that encompass referent areas while also retrieving context tokens to ensure a distinct separation between foreground and background. Comprehensive experiments validate the method's effectiveness.

Key facts

  • LiteLVLM is a training-free token pruning strategy.
  • It targets pixel grounding tasks in large vision-language models.
  • The method reverses CLIP's visual-text similarity ranking.
  • It retains visual tokens covering referent regions.
  • It recovers context tokens for foreground-background separation.
  • The approach addresses computational overhead from visual tokens.
  • The research is published on arXiv with ID 2605.13178.
  • The method is text-guided.

Entities

Institutions

  • arXiv

Sources