LLM Judges as Human Evaluation Augmentation: A Two-Stage Sampling Approach

ai-technology · 2026-05-20

A new paper on arXiv proposes using large language models (LLMs) as auxiliary evaluators rather than substitutes for human raters in AI system assessment. The authors introduce a two-stage sampling design where LLM evaluations are collected for all observations first, and human ratings are then selectively gathered to augment the LLM data. This approach addresses the high cost and scalability issues of expert human evaluation while providing a formal statistical framework, moving beyond ad hoc agreement metrics commonly used to justify replacing human judges. The paper shifts the LLM's role from substitutive to complementary, aiming to improve efficiency and reliability in high-stakes applications such as safety and quality assessment.

Key facts

arXiv paper 2605.16354
LLMs used as automated evaluators of AI systems
High-stakes applications include safety and quality assessment
Expert human ratings are costly and difficult to scale
Current LLM evaluator deployment is ad hoc
Paper shifts LLM role from substitutive to auxiliary
Two-stage sampling design proposed
LLM evaluations measured for all observations at first stage
Human ratings collected at second stage

LLM Judges as Human Evaluation Augmentation: A Two-Stage Sampling Approach

Key facts

Entities

Institutions

Sources