ARTFEED — Contemporary Art Intelligence

CGC Framework Boosts Fine-Grained Multi-Image Understanding in MLLMs

ai-technology · 2026-04-27

Researchers propose Compositional Grounded Contrast (CGC), a low-cost framework to improve fine-grained multi-image understanding in Multimodal Large Language Models (MLLMs). CGC addresses spatial hallucination, attention leakage, and object constancy failures by constructing compositional multi-image training instances from existing single-image grounding annotations. It uses Inter-Image Contrast and Intra-Image Contrast to introduce semantically decoupled distractor contexts and correlated cross-view samples. A Rule-Based Spatial Reward within the GRPO framework further enhances source-image grounding. The method avoids expensive human annotations or large-scale chain-of-thought data generation.

Key facts

  • CGC stands for Compositional Grounded Contrast.
  • It targets fine-grained multi-image understanding in MLLMs.
  • Addresses spatial hallucination, attention leakage, and object constancy failures.
  • Uses Inter-Image Contrast and Intra-Image Contrast.
  • Builds on existing single-image grounding annotations.
  • Introduces a Rule-Based Spatial Reward within the GRPO framework.
  • Avoids expensive human annotations or large-scale CoT data generation.
  • Published on arXiv with ID 2604.22498.

Entities

Institutions

  • arXiv

Sources