VCap: Hypergeometric Rewards Improve Visual Captioning
A new reinforcement learning method for visual captioning, VCap, uses a Witness-Adjudicator reward to improve factual accuracy. The approach pairs a reference caption (witness) with visual signals (adjudicator) to verify factual consistency, achieving hypergeometric-distribution-level precision. This enables effective learning even from imperfect references, addressing limitations of existing reward designs that lack fine-grained factual verification. The method targets omission and hallucination in multimodal large language models (MLLMs).
Key facts
- VCap is a Witness-Adjudicator reward for visual captioning.
- It pairs reference caption (witness) with visual signal (adjudicator).
- Reward signal has hypergeometric-distribution-level precision.
- Addresses omission and hallucination in MLLMs.
- Enables learning from imperfect references.
- Existing reward designs lack fine-grained factual verification.
- Published on arXiv with ID 2605.28023.
- Announcement type is cross.
Entities
Institutions
- arXiv