DiffCap-Bench: New Benchmark for Image Difference Captioning
A new benchmark called DiffCap-Bench has been developed by researchers to enhance Image Difference Captioning (IDC) and overcome the shortcomings of current benchmarks. IDC focuses on generating natural language descriptions that highlight differences between pairs of images, serving as a tool for assessing fine-grained change perception, cross-modal reasoning, and the creation of image editing data. Previous benchmarks have been criticized for their lack of diversity and compositional complexity, while conventional lexical-overlap metrics, such as BLEU and METEOR, do not adequately measure semantic consistency or address hallucinations. DiffCap-Bench features ten unique difference categories to promote diversity and complexity. It also introduces an LLM-as-a-Judge evaluation method based on human-validated Difference Lists, facilitating a thorough evaluation of models' abilities to identify and articulate visual changes. This research is available on arXiv with the identifier 2605.04503.
Key facts
- DiffCap-Bench is a new benchmark for Image Difference Captioning (IDC).
- IDC generates natural language descriptions identifying differences between two images.
- Existing benchmarks lack diversity and compositional complexity.
- Standard metrics like BLEU and METEOR fail to capture semantic consistency.
- DiffCap-Bench covers ten distinct difference categories.
- The benchmark uses an LLM-as-a-Judge evaluation protocol.
- The evaluation protocol is grounded in human-validated Difference Lists.
- The work is published on arXiv with identifier 2605.04503.
Entities
Institutions
- arXiv