New LLM-as-a-Judge Standard for Multi-Hop RAG Evaluation

ai-technology · 2026-05-28

A team of researchers has introduced a baseline measurement standard aimed at evaluating retrieval-augmented generation (RAG) systems through LLM-as-a-judge, tackling challenges in multi-hop RAG assessment. This standard establishes parameters such as a fixed top-100 candidate pool, an evidence budget, an answer limit, a generator, and a prompt. It mandates pre-registered hypotheses, cluster-aware inference, precise cluster sign-flip verification, and replication by a second judge. They employed the Genetic Algorithm Decoder for Multi-hop Evidence Composition (GADMEC) to rigorously test 40 tasks. Their findings indicate that benchmarks based on clustering may exaggerate advancements.

Key facts

Proposes a minimum measurement standard for LLM-as-a-judge comparisons in RAG
Standard fixes top-100 candidate pool, evidence budget, answer cap, generator, and prompt
Requires pre-registered hypotheses, cluster-aware inference, exact cluster sign-flip check, and second-judge replication
Stress-tested with Genetic Algorithm Decoder for Multi-hop Evidence Composition (GADMEC) on 40 tasks
Clustered benchmarks can overstate progress
Published on arXiv with ID 2605.27789
Addresses measurement problems in multi-hop RAG
Focuses on retrieval quality, answer length, lexical overlap, and clustered data

New LLM-as-a-Judge Standard for Multi-Hop RAG Evaluation

Key facts

Entities

Institutions

Sources