VGR: A New MLLM for Fine-Grained Visual Reasoning

ai-technology · 2026-05-04

Researchers have introduced VGR (Visual Grounded Reasoning), a novel multimodal large language model (MLLM) designed to overcome the limitations of existing chain-of-thought reasoning approaches that operate solely in language space and are confined to math or science domains. VGR enhances fine-grained visual perception by first detecting relevant image regions and then providing precise answers based on those regions, rather than reasoning only in language. To train VGR, the team created a large-scale supervised fine-tuning dataset called VGR-SFT, which contains reasoning data combining vision grounding and language deduction. The paper is available on arXiv under identifier 2506.11991.

Key facts

VGR is a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities.
Existing multimodal chain-of-thought reasoning approaches rely on reasoning in pure language space, suffering from language bias and limited to math or science domains.
VGR first detects relevant image regions that may help solve problems, then provides precise answers based on replayed image regions.
A large-scale SFT dataset called VGR-SFT was created, containing reasoning data with mixed vision grounding and language deduction.
The paper was announced on arXiv with identifier 2506.11991.
The research addresses limitations in complex visual reasoning tasks that demand comprehensive understanding of image details.
VGR differs from traditional MLLMs that answer questions or reason solely in language space.
The approach aims to handle complex visual reasoning tasks beyond math and science domains.

VGR: A New MLLM for Fine-Grained Visual Reasoning

Key facts

Entities

Institutions

Sources