ARTFEED — Contemporary Art Intelligence

New LLM-as-a-Judge Standard for Multi-Hop RAG Evaluation

ai-technology · 2026-05-28

A team of researchers has introduced a baseline measurement standard aimed at evaluating retrieval-augmented generation (RAG) systems through LLM-as-a-judge, tackling challenges in multi-hop RAG assessment. This standard establishes parameters such as a fixed top-100 candidate pool, an evidence budget, an answer limit, a generator, and a prompt. It mandates pre-registered hypotheses, cluster-aware inference, precise cluster sign-flip verification, and replication by a second judge. They employed the Genetic Algorithm Decoder for Multi-hop Evidence Composition (GADMEC) to rigorously test 40 tasks. Their findings indicate that benchmarks based on clustering may exaggerate advancements.

Key facts

  • Proposes a minimum measurement standard for LLM-as-a-judge comparisons in RAG
  • Standard fixes top-100 candidate pool, evidence budget, answer cap, generator, and prompt
  • Requires pre-registered hypotheses, cluster-aware inference, exact cluster sign-flip check, and second-judge replication
  • Stress-tested with Genetic Algorithm Decoder for Multi-hop Evidence Composition (GADMEC) on 40 tasks
  • Clustered benchmarks can overstate progress
  • Published on arXiv with ID 2605.27789
  • Addresses measurement problems in multi-hop RAG
  • Focuses on retrieval quality, answer length, lexical overlap, and clustered data

Entities

Institutions

  • arXiv

Sources