MMTR-Bench: Benchmarking MLLMs on Text Reconstruction from Visual Context

ai-technology · 2026-04-25

A new benchmark called MMTR-Bench has been developed by researchers to assess the inherent capability of Multimodal Large Language Models (MLLMs) in reconstructing masked text using visual context. Unlike traditional question-answering formats, MMTR-Bench does not utilize explicit prompts, compelling models to retrieve masked text from single or multiple pages in real-world scenarios like documents and webpages. This approach separates the reconstruction task from instruction-following skills, allowing for a focused evaluation of a model’s understanding of layout, visual grounding, and knowledge integration. Comprising 2,771 test samples across various languages and target lengths, the researchers introduce a level-aware evaluation method. Tests on representative MLLMs indicate that this benchmark presents a considerable challenge, particularly for reconstructing sentences and paragraphs. The homepage can be found at https://.

Key facts

MMTR-Bench evaluates MLLMs on masked text reconstruction from visual context.
The benchmark eliminates explicit prompts, requiring models to recover text from visual input.
It covers real-world domains like documents and webpages.
MMTR-Bench includes 2,771 test samples across multiple languages.
A level-aware evaluation protocol is proposed for diverse target lengths.
Experiments show significant challenge for sentence- and paragraph-level reconstruction.
The benchmark isolates reconstruction from instruction-following abilities.
Homepage: https://

MMTR-Bench: Benchmarking MLLMs on Text Reconstruction from Visual Context

Key facts

Entities

Institutions

Sources