CAVE: New Method for Fragmented Visual Reasoning in VLMs

ai-technology · 2026-05-20

To tackle Fragmented Visual Reasoning in Vision-Language Models (VLMs), researchers have introduced Credit Assignment for Visual Evidence (CAVE), a structured process-reward framework grounded in GRPO. CAVE assesses intermediate reasoning stages through three indicators: belief update, evidence acquisition, and adaptive focus control. Additionally, the team has launched TRACER-Bench, a benchmark that encompasses four nonlocal and semantically similar reasoning dimensions, highlighting essential intermediate evidence. Experimental results indicate enhanced performance in reasoning with fragmented visual evidence.

Key facts

CAVE is a structured process-reward method based on GRPO.
CAVE evaluates intermediate reasoning steps at the action level.
Three reasoning process signals: belief update, evidence acquisition, adaptive focus control.
TRACER-Bench covers four nonlocal and semantically confusable reasoning dimensions.
TRACER-Bench provides key intermediate evidence to supervise reasoning paths.
The work addresses Fragmented Visual Reasoning in VLMs.
VLMs struggle with integrating nonlocal visual information.
Experiments show CAVE improves visual reasoning strategies.

CAVE: New Method for Fragmented Visual Reasoning in VLMs

Key facts

Entities

Institutions

Sources