ViDoRe v3 Benchmark Introduced for Multimodal RAG Evaluation Across Complex Scenarios
A newly established benchmark, ViDoRe v3, has been introduced to evaluate Retrieval-Augmented Generation (RAG) systems in intricate real-world contexts. This extensive multimodal assessment tool goes beyond mere text retrieval, tackling the interpretation of visual data such as tables, charts, and images, as well as the synthesis of information from various documents and precise source grounding. It includes 10 datasets from various professional fields, comprising around 26,000 document pages and 3,099 human-validated queries in six languages. With 12,000 hours dedicated to human annotation, it offers high-quality evaluations for retrieval relevance, bounding box localization, and verified answers. The benchmark paper, identified as 2601.08620v2, was published on arXiv, highlighting the inadequacies of existing benchmarks that often overlook the complexities of multimodal content and information integration.
Key facts
- ViDoRe v3 is a comprehensive multimodal RAG benchmark
- The benchmark addresses challenges beyond simple single-document retrieval
- It includes interpretation of visual elements like tables, charts, and images
- The benchmark covers 10 datasets across diverse professional domains
- It comprises approximately 26,000 document pages paired with 3,099 human-verified queries
- Queries are available in 6 languages
- 12,000 hours of human annotation effort were invested
- The benchmark provides annotations for retrieval relevance, bounding box localization, and verified reference answers
Entities
Institutions
- arXiv