Visually-Guided Policy Optimization Enhances VLM Reasoning
A new framework called Visually-Guided Policy Optimization (VGPO) addresses visual faithfulness deficiencies in vision-language models (VLMs) during reinforcement learning with verifiable rewards (RLVR). The authors identify two key issues: sparse attention activation to visual tokens and temporal visual forgetting across reasoning steps. VGPO introduces a Visual Attention Compensation mechanism that uses visual similarity to amplify visual cues and progressively increases visual expectations in later steps. Additionally, a dual-grained advantage re-weighting strategy is implemented along intra-trajectory steps. The work is published on arXiv with identifier 2604.09349.
Key facts
- VGPO stands for Visually-Guided Policy Optimization
- RLVR is reinforcement learning with verifiable rewards
- VLMs are vision-language models
- Visual Attention Compensation mechanism uses visual similarity
- Dual-grained advantage re-weighting is applied intra-trajectory
- Paper ID: arXiv:2604.09349
- Announce type: replace-cross
- Empirical analysis reveals temporal visual forgetting
Entities
Institutions
- arXiv