GRIP-VLM: Reinforcement Learning for Efficient Vision-Language Model Pruning
Researchers have introduced GRIP-VLM, a novel framework that uses reinforcement learning to prune visual tokens in Vision-Language Models (VLMs). Traditional pruning methods rely on continuous-gradient relaxations, which often get stuck in suboptimal local minima due to the discrete, non-convex nature of token pruning. GRIP-VLM formulates pruning as a Markov Decision Process and employs Group Relative Policy Optimization (GRPO) with supervised warm-up to directly explore the discrete selection space. The method includes a budget-aware scorer to manage compression rates. This approach aims to reduce the computational overhead of processing massive numbers of visual tokens in VLMs, making them more efficient without sacrificing performance.
Key facts
- GRIP-VLM uses reinforcement learning for visual token pruning in VLMs.
- Traditional pruning methods rely on continuous-gradient relaxations.
- Token pruning is a discrete, non-convex combinatorial problem.
- GRIP-VLM formulates pruning as a Markov Decision Process.
- It employs Group Relative Policy Optimization (GRPO) with supervised warm-up.
- The framework includes a budget-aware scorer.
- The method aims to reduce computational overhead in VLMs.
- The paper is available on arXiv with ID 2605.13375.
Entities
Institutions
- arXiv