ClaimDiff-RL: Fine-Grained Caption RL via Visual Claim Comparison

ai-technology · 2026-05-22

The recently introduced framework, ClaimDiff-RL, tackles the issue of reward granularity in reinforcement learning specifically for long-form image captioning. Existing techniques evaluate complete captions as a whole, which masks localized mistakes within specific visual claims. In contrast, ClaimDiff-RL employs reference-conditioned atomic claim differences as units of reward. A multimodal evaluator identifies visually grounded discrepancies between an actor's caption and a reference caption, checks each against the corresponding image, categorizes error types and severity levels using open vocabulary, and generates statistics for each difference to inform reward composition. This method distinguishes between hallucinations and omissions, allowing for precise optimization of both factual accuracy and coverage.

Key facts

ClaimDiff-RL is introduced to solve reward granularity in RL for image captioning.
Current methods compress local errors into a single sequence-level signal.
The framework uses reference-conditioned atomic claim differences as reward units.
A multimodal judge enumerates visually grounded differences between captions.
Each difference is verified against the image.
Open-vocabulary error types and severity levels are assigned.
Per-difference statistics are produced for reward composition.
The approach separates hallucination from omission.

Entities

—

Sources

arXiv cs.AI — 2026-05-21