New Benchmarks Reveal Cross-Modal Inconsistency in MLLMs

ai-technology · 2026-04-24

Researchers introduced REST and REST+, two benchmarks for evaluating cross-modal inconsistency in multimodal large language models (MLLMs). These benchmarks contain samples with identical semantic information across image, text, and mixed modalities. Evaluating 15 state-of-the-art MLLMs, the study found substantial variation in modality inconsistency, even when accounting for OCR errors. Neither rendering text as image nor image as text resolved the inconsistency. Visual characteristics like text color and resolution, but not font, and the number of vision tokens affected performance.

Key facts

REST and REST+ are new benchmarks for cross-modal inconsistency.
Benchmarks include samples with same info in image, text, and mixed modalities.
15 state-of-the-art MLLMs were evaluated.
Modality inconsistency varies substantially among models.
OCR errors do not fully explain inconsistency.
Rendering text as image or image as text does not solve inconsistency.
Text color and resolution impact performance; font does not.
Number of vision tokens affects model performance.

Entities

—

Sources

arXiv cs.AI — 2026-04-23