ARTFEED — Contemporary Art Intelligence

Hard Negative Captions Dataset Improves Fine-Grained Visual-Linguistic Comprehension

ai-technology · 2026-05-09

Researchers propose Hard Negative Captions (HNC), an automatically created dataset of foiled hard negative captions for Image-Text-Matching (ITM) training. HNC aims to improve fine-grained cross-modal comprehension in vision-language models by addressing weak associations in web-collected image-text pairs. The team also provides a challenging manually-created test set for benchmarking models on fine-grained cross-modal mismatch tasks with varying compositional complexity. Results show that training on HNC enhances zero-shot capabilities in detecting mismatches on diagnostic tasks and improves robustness under noisy visual input scenarios.

Key facts

  • HNC is an automatically created dataset of foiled hard negative captions.
  • It is designed for Image-Text-Matching (ITM) training.
  • The goal is to achieve fine-grained cross-modal comprehension in vision-language models.
  • A manually-created test set benchmarks models on fine-grained cross-modal mismatch tasks.
  • The test set has varying levels of compositional complexity.
  • Training on HNC improves zero-shot capabilities in detecting mismatches.
  • Models trained on HNC perform robustly under noisy visual input scenarios.
  • The research addresses weak associations in web-collected image-text pairs.

Entities

Sources