LLM-as-a-Judge Reliability Assessed via Item Response Theory

ai-technology · 2026-06-01

A new diagnostic framework using Item Response Theory (IRT) evaluates the reliability of LLMs as judges in automated evaluation. The two-phase framework, based on the Graded Response Model (GRM), measures intrinsic consistency under prompt variations and human alignment with quality assessments. Empirical tests on diverse LLM judges show that IRT-GRM provides interpretable signals for systematic judgment diagnosis, offering practical guidance for verifying reliability. The study is published on arXiv with ID 2602.00521.

Key facts

Framework uses Item Response Theory (IRT) to assess LLM-as-a-Judge reliability.
Two-phase diagnostic framework: intrinsic consistency and human alignment.
Based on Graded Response Model (GRM) of IRT.
Intrinsic consistency measures stability under prompt variations.
Human alignment captures correspondence with human quality assessments.
Empirical examination of diverse LLM judges.
IRT-GRM yields interpretable signals for diagnosing judgments.
Published on arXiv with ID 2602.00521.

LLM-as-a-Judge Reliability Assessed via Item Response Theory

Key facts

Entities

Institutions

Sources