New Benchmark FATHOMS-RAG Evaluates Multimodal RAG Pipelines

ai-technology · 2026-05-25

Researchers have introduced FATHOMS-RAG, a benchmark for evaluating retrieval-augmented generation (RAG) pipelines across multiple modalities. The framework includes a human-created dataset of 93 questions testing ingestion of text, tables, images, and cross-modal data. It also proposes a phrase-level recall metric for correctness, a nearest-neighbor embedding classifier to detect hallucinations, and comparative evaluations of two open-source retrieval pipelines and four closed-source foundation models. A third-party human evaluation assesses alignment of outputs. The work aims to differentiate from existing benchmarks focused on single aspects like retrieval.

Key facts

FATHOMS-RAG stands for Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation.
The dataset contains 93 human-created questions.
Questions evaluate ingestion of text, tables, images, and cross-modal data across one or more documents.
A phrase-level recall metric measures correctness.
A nearest-neighbor embedding classifier identifies potential hallucinations.
Two open-source retrieval pipelines and four closed-source foundation models were evaluated.
A third-party human evaluation assessed output alignment.
The benchmark is designed to evaluate entire RAG pipelines, not just retrieval.

Entities

—

Sources

arXiv cs.AI — 2026-05-25