M$^3$-VQA Benchmark Tests Multimodal AI Reasoning
A novel benchmark named M$^3$-VQA has been launched by researchers to assess multimodal large language models (MLLMs) regarding their ability to understand fine-grained entities and engage in complex multi-hop reasoning. In contrast to current VQA datasets that emphasize broad categories and individual entities, M$^3$-VQA presents multi-entity inquiries that necessitate reasoning across various documents, utilizing both visual and textual data. This benchmark offers traceable evidence along with a carefully curated knowledge base. Evaluations conducted on 16 prominent MLLMs across three scenarios—lacking external knowledge, with verified evidence, and incorporating retrieval-augmented input—highlight considerable difficulties, particularly as models struggle without external resources.
Key facts
- M$^3$-VQA is a knowledge-based VQA benchmark.
- It evaluates MLLMs on multimodal entity understanding and multi-hop reasoning.
- Questions involve multiple distinct entities from visual and textual sources.
- Requires sequential and parallel multi-hop reasoning across documents.
- Includes a curated multimodal knowledge base and traceable evidence.
- 16 leading MLLMs were evaluated under three settings.
- Models performed poorly without external knowledge.
- The benchmark highlights challenges in knowledge acquisition and reasoning.
Entities
—