ARTFEED — Contemporary Art Intelligence

M$^3$-VQA Benchmark Tests Multimodal AI Reasoning

ai-technology · 2026-04-30

A novel benchmark named M$^3$-VQA has been launched by researchers to assess multimodal large language models (MLLMs) regarding their ability to understand fine-grained entities and engage in complex multi-hop reasoning. In contrast to current VQA datasets that emphasize broad categories and individual entities, M$^3$-VQA presents multi-entity inquiries that necessitate reasoning across various documents, utilizing both visual and textual data. This benchmark offers traceable evidence along with a carefully curated knowledge base. Evaluations conducted on 16 prominent MLLMs across three scenarios—lacking external knowledge, with verified evidence, and incorporating retrieval-augmented input—highlight considerable difficulties, particularly as models struggle without external resources.

Key facts

  • M$^3$-VQA is a knowledge-based VQA benchmark.
  • It evaluates MLLMs on multimodal entity understanding and multi-hop reasoning.
  • Questions involve multiple distinct entities from visual and textual sources.
  • Requires sequential and parallel multi-hop reasoning across documents.
  • Includes a curated multimodal knowledge base and traceable evidence.
  • 16 leading MLLMs were evaluated under three settings.
  • Models performed poorly without external knowledge.
  • The benchmark highlights challenges in knowledge acquisition and reasoning.

Entities

Sources