Hard Negative Captions Dataset Improves Fine-Grained Visual-Linguistic Comprehension
Researchers propose Hard Negative Captions (HNC), an automatically created dataset of foiled hard negative captions for Image-Text-Matching (ITM) training. HNC aims to improve fine-grained cross-modal comprehension in vision-language models by addressing weak associations in web-collected image-text pairs. The team also provides a challenging manually-created test set for benchmarking models on fine-grained cross-modal mismatch tasks with varying compositional complexity. Results show that training on HNC enhances zero-shot capabilities in detecting mismatches on diagnostic tasks and improves robustness under noisy visual input scenarios.
Key facts
- HNC is an automatically created dataset of foiled hard negative captions.
- It is designed for Image-Text-Matching (ITM) training.
- The goal is to achieve fine-grained cross-modal comprehension in vision-language models.
- A manually-created test set benchmarks models on fine-grained cross-modal mismatch tasks.
- The test set has varying levels of compositional complexity.
- Training on HNC improves zero-shot capabilities in detecting mismatches.
- Models trained on HNC perform robustly under noisy visual input scenarios.
- The research addresses weak associations in web-collected image-text pairs.
Entities
—