WikiVQABench: Knowledge-Grounded VQA Benchmark from Wikipedia and Wikidata

ai-technology · 2026-05-22

WikiVQABench has been launched by researchers as a human-curated benchmark aimed at knowledge-grounded Visual Question Answering (VQA). Unlike conventional VQA benchmarks that rely solely on visual content for perception-based tasks, WikiVQABench necessitates external knowledge that cannot be gleaned from images alone. This benchmark is created by methodically integrating images from Wikipedia, their related article captions, and structured data from Wikidata. A pipeline utilizing large language models (LLMs) produces candidate sets of multiple-choice questions, which are subsequently evaluated and refined by human annotators to verify factual accuracy, visual-text alignment, and the requirement for external knowledge beyond visual cues. It features a significant array of Wikipedia images paired with curated multiple-choice questions aimed at assessing knowledge-grounded reasoning.

Key facts

WikiVQABench is a knowledge-grounded VQA benchmark.
It combines Wikipedia images, captions, and Wikidata knowledge.
LLMs generate candidate question-answer sets.
Human annotators curate for factual correctness and consistency.
Questions require external knowledge beyond visual evidence.
The benchmark includes multiple-choice questions.
It addresses limitations of perception-based VQA benchmarks.
The work is published on arXiv with ID 2605.21479.

WikiVQABench: Knowledge-Grounded VQA Benchmark from Wikipedia and Wikidata

Key facts

Entities

Institutions

Sources