LLM-as-a-Judge Reliability Assessed via Item Response Theory
A new diagnostic framework using Item Response Theory (IRT) evaluates the reliability of LLMs as judges in automated evaluation. The two-phase framework, based on the Graded Response Model (GRM), measures intrinsic consistency under prompt variations and human alignment with quality assessments. Empirical tests on diverse LLM judges show that IRT-GRM provides interpretable signals for systematic judgment diagnosis, offering practical guidance for verifying reliability. The study is published on arXiv with ID 2602.00521.
Key facts
- Framework uses Item Response Theory (IRT) to assess LLM-as-a-Judge reliability.
- Two-phase diagnostic framework: intrinsic consistency and human alignment.
- Based on Graded Response Model (GRM) of IRT.
- Intrinsic consistency measures stability under prompt variations.
- Human alignment captures correspondence with human quality assessments.
- Empirical examination of diverse LLM judges.
- IRT-GRM yields interpretable signals for diagnosing judgments.
- Published on arXiv with ID 2602.00521.
Entities
Institutions
- arXiv