LiteLVLM: Training-Free Token Pruning for Efficient Pixel Grounding in Vision-Language Models
A novel approach called LiteLVLM has been developed by researchers, offering a training-free, text-guided token pruning technique aimed at enhancing pixel grounding inference within large vision-language models. This strategy tackles the computational burden associated with visual tokens by inverting the ranking of CLIP's visual-text similarity. This allows for the preservation of tokens that encompass referent areas while also retrieving context tokens to ensure a distinct separation between foreground and background. Comprehensive experiments validate the method's effectiveness.
Key facts
- LiteLVLM is a training-free token pruning strategy.
- It targets pixel grounding tasks in large vision-language models.
- The method reverses CLIP's visual-text similarity ranking.
- It retains visual tokens covering referent regions.
- It recovers context tokens for foreground-background separation.
- The approach addresses computational overhead from visual tokens.
- The research is published on arXiv with ID 2605.13178.
- The method is text-guided.
Entities
Institutions
- arXiv