StratRAG Dataset Benchmarks Multi-Hop Retrieval for RAG Systems

ai-technology · 2026-04-29

StratRAG serves as an open-source dataset for evaluating retrieval systems in the context of Retrieval-Augmented Generation (RAG) focused on multi-hop reasoning tasks amid realistic, noisy document pools. It is based on the distractor format of HotpotQA, featuring 2,200 instances across three question categories: bridge, comparison, and yes-no. Each instance is associated with a set of 15 candidate documents, which includes 2 gold-standard documents and 13 distractors that are topically relevant. The evaluation involved three retrieval methods: BM25, dense retrieval (all-MiniLM-L6-v2), and hybrid fusion, using metrics such as Recall@k, MRR, and NDCG@5 on the validation dataset. The hybrid approach outperformed others (Recall@2 = 0.70, MRR = 0.93), although bridge questions proved to be notably more challenging (Recall@2 = 0.67), indicating a need for further exploration of reinforcement-learning retrieval strategies. The dataset can be accessed publicly at https://huggingface.co/datasets/Aryanp088/StratR.

Key facts

StratRAG is an open-source retrieval evaluation dataset for RAG systems.
It is derived from HotpotQA's distractor setting.
The dataset contains 2,200 examples across bridge, comparison, and yes-no question types.
Each example has a pool of 15 candidate documents: 2 gold and 13 distractors.
Three retrieval strategies were benchmarked: BM25, dense retrieval (all-MiniLM-L6-v2), and hybrid fusion.
Metrics used: Recall@k, MRR, and NDCG@5 on the validation set.
Hybrid retrieval achieved best overall performance (Recall@2 = 0.70, MRR = 0.93).
Bridge questions were hardest (Recall@2 = 0.67).
Future work includes reinforcement-learning-based retrieval policies.
StratRAG is available on Hugging Face.

Entities

—

Sources

arXiv cs.AI — 2026-04-28