CAVE: New Method for Fragmented Visual Reasoning in VLMs
To tackle Fragmented Visual Reasoning in Vision-Language Models (VLMs), researchers have introduced Credit Assignment for Visual Evidence (CAVE), a structured process-reward framework grounded in GRPO. CAVE assesses intermediate reasoning stages through three indicators: belief update, evidence acquisition, and adaptive focus control. Additionally, the team has launched TRACER-Bench, a benchmark that encompasses four nonlocal and semantically similar reasoning dimensions, highlighting essential intermediate evidence. Experimental results indicate enhanced performance in reasoning with fragmented visual evidence.
Key facts
- CAVE is a structured process-reward method based on GRPO.
- CAVE evaluates intermediate reasoning steps at the action level.
- Three reasoning process signals: belief update, evidence acquisition, adaptive focus control.
- TRACER-Bench covers four nonlocal and semantically confusable reasoning dimensions.
- TRACER-Bench provides key intermediate evidence to supervise reasoning paths.
- The work addresses Fragmented Visual Reasoning in VLMs.
- VLMs struggle with integrating nonlocal visual information.
- Experiments show CAVE improves visual reasoning strategies.
Entities
Institutions
- arXiv