New Benchmark FATHOMS-RAG Evaluates Multimodal RAG Pipelines
Researchers have introduced FATHOMS-RAG, a benchmark for evaluating retrieval-augmented generation (RAG) pipelines across multiple modalities. The framework includes a human-created dataset of 93 questions testing ingestion of text, tables, images, and cross-modal data. It also proposes a phrase-level recall metric for correctness, a nearest-neighbor embedding classifier to detect hallucinations, and comparative evaluations of two open-source retrieval pipelines and four closed-source foundation models. A third-party human evaluation assesses alignment of outputs. The work aims to differentiate from existing benchmarks focused on single aspects like retrieval.
Key facts
- FATHOMS-RAG stands for Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation.
- The dataset contains 93 human-created questions.
- Questions evaluate ingestion of text, tables, images, and cross-modal data across one or more documents.
- A phrase-level recall metric measures correctness.
- A nearest-neighbor embedding classifier identifies potential hallucinations.
- Two open-source retrieval pipelines and four closed-source foundation models were evaluated.
- A third-party human evaluation assessed output alignment.
- The benchmark is designed to evaluate entire RAG pipelines, not just retrieval.
Entities
—