M$^3$-VQA Benchmark Tests Multimodal AI Reasoning

ai-technology · 2026-04-30

A novel benchmark named M$^3$-VQA has been launched by researchers to assess multimodal large language models (MLLMs) regarding their ability to understand fine-grained entities and engage in complex multi-hop reasoning. In contrast to current VQA datasets that emphasize broad categories and individual entities, M$^3$-VQA presents multi-entity inquiries that necessitate reasoning across various documents, utilizing both visual and textual data. This benchmark offers traceable evidence along with a carefully curated knowledge base. Evaluations conducted on 16 prominent MLLMs across three scenarios—lacking external knowledge, with verified evidence, and incorporating retrieval-augmented input—highlight considerable difficulties, particularly as models struggle without external resources.

Key facts

M$^3$-VQA is a knowledge-based VQA benchmark.
It evaluates MLLMs on multimodal entity understanding and multi-hop reasoning.
Questions involve multiple distinct entities from visual and textual sources.
Requires sequential and parallel multi-hop reasoning across documents.
Includes a curated multimodal knowledge base and traceable evidence.
16 leading MLLMs were evaluated under three settings.
Models performed poorly without external knowledge.
The benchmark highlights challenges in knowledge acquisition and reasoning.

Entities

—

Sources

arXiv cs.AI — 2026-04-29