New LLM-as-a-Judge Standard for Multi-Hop RAG Evaluation
A team of researchers has introduced a baseline measurement standard aimed at evaluating retrieval-augmented generation (RAG) systems through LLM-as-a-judge, tackling challenges in multi-hop RAG assessment. This standard establishes parameters such as a fixed top-100 candidate pool, an evidence budget, an answer limit, a generator, and a prompt. It mandates pre-registered hypotheses, cluster-aware inference, precise cluster sign-flip verification, and replication by a second judge. They employed the Genetic Algorithm Decoder for Multi-hop Evidence Composition (GADMEC) to rigorously test 40 tasks. Their findings indicate that benchmarks based on clustering may exaggerate advancements.
Key facts
- Proposes a minimum measurement standard for LLM-as-a-judge comparisons in RAG
- Standard fixes top-100 candidate pool, evidence budget, answer cap, generator, and prompt
- Requires pre-registered hypotheses, cluster-aware inference, exact cluster sign-flip check, and second-judge replication
- Stress-tested with Genetic Algorithm Decoder for Multi-hop Evidence Composition (GADMEC) on 40 tasks
- Clustered benchmarks can overstate progress
- Published on arXiv with ID 2605.27789
- Addresses measurement problems in multi-hop RAG
- Focuses on retrieval quality, answer length, lexical overlap, and clustered data
Entities
Institutions
- arXiv