ARTFEED — Contemporary Art Intelligence

LLM-as-a-Judge Reliability Assessed via Item Response Theory

ai-technology · 2026-06-01

A new diagnostic framework using Item Response Theory (IRT) evaluates the reliability of LLMs as judges in automated evaluation. The two-phase framework, based on the Graded Response Model (GRM), measures intrinsic consistency under prompt variations and human alignment with quality assessments. Empirical tests on diverse LLM judges show that IRT-GRM provides interpretable signals for systematic judgment diagnosis, offering practical guidance for verifying reliability. The study is published on arXiv with ID 2602.00521.

Key facts

  • Framework uses Item Response Theory (IRT) to assess LLM-as-a-Judge reliability.
  • Two-phase diagnostic framework: intrinsic consistency and human alignment.
  • Based on Graded Response Model (GRM) of IRT.
  • Intrinsic consistency measures stability under prompt variations.
  • Human alignment captures correspondence with human quality assessments.
  • Empirical examination of diverse LLM judges.
  • IRT-GRM yields interpretable signals for diagnosing judgments.
  • Published on arXiv with ID 2602.00521.

Entities

Institutions

  • arXiv

Sources