ViDoRe v3 Benchmark Introduced for Multimodal RAG Evaluation Across Complex Scenarios

ai-technology · 2026-04-22

A newly established benchmark, ViDoRe v3, has been introduced to evaluate Retrieval-Augmented Generation (RAG) systems in intricate real-world contexts. This extensive multimodal assessment tool goes beyond mere text retrieval, tackling the interpretation of visual data such as tables, charts, and images, as well as the synthesis of information from various documents and precise source grounding. It includes 10 datasets from various professional fields, comprising around 26,000 document pages and 3,099 human-validated queries in six languages. With 12,000 hours dedicated to human annotation, it offers high-quality evaluations for retrieval relevance, bounding box localization, and verified answers. The benchmark paper, identified as 2601.08620v2, was published on arXiv, highlighting the inadequacies of existing benchmarks that often overlook the complexities of multimodal content and information integration.

Key facts

ViDoRe v3 is a comprehensive multimodal RAG benchmark
The benchmark addresses challenges beyond simple single-document retrieval
It includes interpretation of visual elements like tables, charts, and images
The benchmark covers 10 datasets across diverse professional domains
It comprises approximately 26,000 document pages paired with 3,099 human-verified queries
Queries are available in 6 languages
12,000 hours of human annotation effort were invested
The benchmark provides annotations for retrieval relevance, bounding box localization, and verified reference answers

ViDoRe v3 Benchmark Introduced for Multimodal RAG Evaluation Across Complex Scenarios

Key facts

Entities

Institutions

Sources