Hard Negative Captions Dataset Improves Fine-Grained Visual-Linguistic Comprehension

ai-technology · 2026-05-09

Researchers propose Hard Negative Captions (HNC), an automatically created dataset of foiled hard negative captions for Image-Text-Matching (ITM) training. HNC aims to improve fine-grained cross-modal comprehension in vision-language models by addressing weak associations in web-collected image-text pairs. The team also provides a challenging manually-created test set for benchmarking models on fine-grained cross-modal mismatch tasks with varying compositional complexity. Results show that training on HNC enhances zero-shot capabilities in detecting mismatches on diagnostic tasks and improves robustness under noisy visual input scenarios.

Key facts

HNC is an automatically created dataset of foiled hard negative captions.
It is designed for Image-Text-Matching (ITM) training.
The goal is to achieve fine-grained cross-modal comprehension in vision-language models.
A manually-created test set benchmarks models on fine-grained cross-modal mismatch tasks.
The test set has varying levels of compositional complexity.
Training on HNC improves zero-shot capabilities in detecting mismatches.
Models trained on HNC perform robustly under noisy visual input scenarios.
The research addresses weak associations in web-collected image-text pairs.

Entities

—

Sources

arXiv cs.AI — 2026-05-09