CGC Framework Boosts Fine-Grained Multi-Image Understanding in MLLMs

ai-technology · 2026-04-27

Researchers propose Compositional Grounded Contrast (CGC), a low-cost framework to improve fine-grained multi-image understanding in Multimodal Large Language Models (MLLMs). CGC addresses spatial hallucination, attention leakage, and object constancy failures by constructing compositional multi-image training instances from existing single-image grounding annotations. It uses Inter-Image Contrast and Intra-Image Contrast to introduce semantically decoupled distractor contexts and correlated cross-view samples. A Rule-Based Spatial Reward within the GRPO framework further enhances source-image grounding. The method avoids expensive human annotations or large-scale chain-of-thought data generation.

Key facts

CGC stands for Compositional Grounded Contrast.
It targets fine-grained multi-image understanding in MLLMs.
Addresses spatial hallucination, attention leakage, and object constancy failures.
Uses Inter-Image Contrast and Intra-Image Contrast.
Builds on existing single-image grounding annotations.
Introduces a Rule-Based Spatial Reward within the GRPO framework.
Avoids expensive human annotations or large-scale CoT data generation.
Published on arXiv with ID 2604.22498.

CGC Framework Boosts Fine-Grained Multi-Image Understanding in MLLMs

Key facts

Entities

Institutions

Sources