COHERENCE Benchmark Tests MLLMs on Fine-Grained Image-Text Alignment

ai-technology · 2026-05-01

Researchers have introduced COHERENCE, a benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on their ability to recover fine-grained image-text correspondences in interleaved multimodal contexts. While existing benchmarks focus on single-image or multi-image comprehension, real-world scenarios like document reading require models to identify relevant textual and visual evidence, establish alignments, and reason over interleaved contexts. COHERENCE aims to fill the gap in systematic evaluation of this fine-grained understanding ability.

Key facts

COHERENCE is a benchmark for fine-grained image-text alignment in interleaved contexts.
Existing MLLM benchmarks mainly focus on single-image or multi-image comprehension.
Real-world scenarios like document reading require interleaved multimodal understanding.
MLLMs must identify relevant textual and visual evidence and establish alignments.
The benchmark was introduced in arXiv paper 2604.27389.
It addresses the lack of systematic benchmarks for interleaved image-text contexts.
COHERENCE evaluates the ability to recover fine-grained correspondences.
The work is from arXiv, published in 2025.

COHERENCE Benchmark Tests MLLMs on Fine-Grained Image-Text Alignment

Key facts

Entities

Institutions

Sources