CGC Framework Boosts Fine-Grained Multi-Image Understanding in MLLMs
Researchers propose Compositional Grounded Contrast (CGC), a low-cost framework to improve fine-grained multi-image understanding in Multimodal Large Language Models (MLLMs). CGC addresses spatial hallucination, attention leakage, and object constancy failures by constructing compositional multi-image training instances from existing single-image grounding annotations. It uses Inter-Image Contrast and Intra-Image Contrast to introduce semantically decoupled distractor contexts and correlated cross-view samples. A Rule-Based Spatial Reward within the GRPO framework further enhances source-image grounding. The method avoids expensive human annotations or large-scale chain-of-thought data generation.
Key facts
- CGC stands for Compositional Grounded Contrast.
- It targets fine-grained multi-image understanding in MLLMs.
- Addresses spatial hallucination, attention leakage, and object constancy failures.
- Uses Inter-Image Contrast and Intra-Image Contrast.
- Builds on existing single-image grounding annotations.
- Introduces a Rule-Based Spatial Reward within the GRPO framework.
- Avoids expensive human annotations or large-scale CoT data generation.
- Published on arXiv with ID 2604.22498.
Entities
Institutions
- arXiv