VG-CoT Dataset Enhances Visual Reasoning in LVLMs
Researchers have introduced the Visual Grounding Chain-of-Thought (VG-CoT) dataset to improve trustworthy visual reasoning in Large Vision-Language Models (LVLMs). The dataset explicitly links each reasoning step to actual visual evidence within images, addressing limitations in existing datasets that suffer from scalability issues due to manual annotation and lack of alignment between multi-step reasoning and image regions. VG-CoT is built using a fully automated three-stage pipeline: first, object- and text-level visual evidence is extracted via state-of-the-art detection and OCR models; second, step-by-step grounded reasoning is generated with GPT-4o; third, grounding is refined through a rationale-driven open-set detection process. A new benchmark accompanies the dataset to evaluate model trustworthiness. The work is detailed in a paper on arXiv (arXiv:2604.21396).
Key facts
- VG-CoT dataset links reasoning steps to visual evidence in images.
- Dataset uses a fully automated three-stage pipeline.
- Pipeline includes detection, OCR, GPT-4o, and open-set detection.
- A new benchmark for trustworthiness is introduced.
- Paper available on arXiv: 2604.21396.
- Addresses scalability issues in existing datasets.
- Focuses on LVLMs and visual reasoning.
- Published as a cross-type announcement.
Entities
Institutions
- arXiv