VGR: A New MLLM for Fine-Grained Visual Reasoning
Researchers have introduced VGR (Visual Grounded Reasoning), a novel multimodal large language model (MLLM) designed to overcome the limitations of existing chain-of-thought reasoning approaches that operate solely in language space and are confined to math or science domains. VGR enhances fine-grained visual perception by first detecting relevant image regions and then providing precise answers based on those regions, rather than reasoning only in language. To train VGR, the team created a large-scale supervised fine-tuning dataset called VGR-SFT, which contains reasoning data combining vision grounding and language deduction. The paper is available on arXiv under identifier 2506.11991.
Key facts
- VGR is a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities.
- Existing multimodal chain-of-thought reasoning approaches rely on reasoning in pure language space, suffering from language bias and limited to math or science domains.
- VGR first detects relevant image regions that may help solve problems, then provides precise answers based on replayed image regions.
- A large-scale SFT dataset called VGR-SFT was created, containing reasoning data with mixed vision grounding and language deduction.
- The paper was announced on arXiv with identifier 2506.11991.
- The research addresses limitations in complex visual reasoning tasks that demand comprehensive understanding of image details.
- VGR differs from traditional MLLMs that answer questions or reason solely in language space.
- The approach aims to handle complex visual reasoning tasks beyond math and science domains.
Entities
Institutions
- arXiv