ARTFEED — Contemporary Art Intelligence

VCap: Hypergeometric Rewards Improve Visual Captioning

other · 2026-05-28

A new reinforcement learning method for visual captioning, VCap, uses a Witness-Adjudicator reward to improve factual accuracy. The approach pairs a reference caption (witness) with visual signals (adjudicator) to verify factual consistency, achieving hypergeometric-distribution-level precision. This enables effective learning even from imperfect references, addressing limitations of existing reward designs that lack fine-grained factual verification. The method targets omission and hallucination in multimodal large language models (MLLMs).

Key facts

  • VCap is a Witness-Adjudicator reward for visual captioning.
  • It pairs reference caption (witness) with visual signal (adjudicator).
  • Reward signal has hypergeometric-distribution-level precision.
  • Addresses omission and hallucination in MLLMs.
  • Enables learning from imperfect references.
  • Existing reward designs lack fine-grained factual verification.
  • Published on arXiv with ID 2605.28023.
  • Announcement type is cross.

Entities

Institutions

  • arXiv

Sources