ARTFEED — Contemporary Art Intelligence

LLM Judges as Human Evaluation Augmentation: A Two-Stage Sampling Approach

ai-technology · 2026-05-20

A new paper on arXiv proposes using large language models (LLMs) as auxiliary evaluators rather than substitutes for human raters in AI system assessment. The authors introduce a two-stage sampling design where LLM evaluations are collected for all observations first, and human ratings are then selectively gathered to augment the LLM data. This approach addresses the high cost and scalability issues of expert human evaluation while providing a formal statistical framework, moving beyond ad hoc agreement metrics commonly used to justify replacing human judges. The paper shifts the LLM's role from substitutive to complementary, aiming to improve efficiency and reliability in high-stakes applications such as safety and quality assessment.

Key facts

  • arXiv paper 2605.16354
  • LLMs used as automated evaluators of AI systems
  • High-stakes applications include safety and quality assessment
  • Expert human ratings are costly and difficult to scale
  • Current LLM evaluator deployment is ad hoc
  • Paper shifts LLM role from substitutive to auxiliary
  • Two-stage sampling design proposed
  • LLM evaluations measured for all observations at first stage
  • Human ratings collected at second stage

Entities

Institutions

  • arXiv

Sources