Mind's Eye Benchmark Reveals Major Gaps in Multimodal AI Visual Reasoning

ai-technology · 2026-04-20

A new benchmark called "Mind's Eye" systematically evaluates multimodal large language models (MLLMs) on visual cognitive and visuospatial reasoning. Developed by researchers and detailed in arXiv preprint 2604.16054v1, the benchmark comprises eight tasks organized under an "A-R-T" taxonomy: Abstraction, Relation, and Transformation. These tasks probe core processes of fluid intelligence, including pattern induction, analogical relation mapping, and mental transformation, drawing inspiration from classic human intelligence tests. The study assessed a diverse suite of both closed-source and open-source MLLMs, comparing their performance against human participants. Human accuracy reached 80%, while the top-performing MLLMs scored below 50%. Error analysis identified three primary failure modes: visual attention allocation, internal perceptual manipulation, and weak abstraction of underlying visual concepts. The findings suggest current MLLMs have significant limitations in visual cognitive reasoning despite impressive progress on standard vision-language benchmarks. This research highlights a critical gap in AI capabilities, emphasizing the need for improved models that can better handle complex visuospatial tasks. The benchmark serves as a tool for future development, aiming to push the boundaries of what multimodal AI can achieve in understanding and manipulating visual information.

Key facts

The benchmark "Mind's Eye" evaluates multimodal large language models (MLLMs) on visual cognitive and visuospatial reasoning.
It includes eight tasks organized under an "A-R-T" taxonomy: Abstraction, Relation, and Transformation.
Tasks are inspired by classic human intelligence tests and probe fluid intelligence processes like pattern induction and mental transformation.
Human participants achieved 80% accuracy on the benchmark.
Top-performing MLLMs scored below 50% accuracy.
Error analysis revealed failures in visual attention allocation, internal perceptual manipulation, and weak abstraction of visual concepts.
The study compared both closed-source and open-source MLLMs.
The research is documented in arXiv preprint 2604.16054v1.

Entities

—

Sources

arXiv cs.AI — 2026-04-20