New Framework for Evaluating LLMs Using Item Response Theory

ai-technology · 2026-05-11

A team of researchers has introduced a scalable and interpretable framework for assessing large language models (LLMs) grounded in Item Response Theory (IRT). This framework redefines evaluation as a series of constrained matrix factorization subproblems, which allows for stable and efficient parameter estimation, backed by theoretical assurances of identifiability and convergence. Testing on both synthetic and real datasets, such as MATH-500 and six benchmarks from the Open LLM Leaderboard, showcases the method's efficacy. This approach overcomes the shortcomings of conventional benchmarking that depend on average accuracy while neglecting stochastic elements and variability.

Key facts

Proposes an interpretable and scalable framework for LLM evaluation based on IRT
Reformulates evaluation as constrained matrix factorization subproblems
Provides theoretical guarantees for identifiability and convergence
Tested on synthetic and real-world datasets including MATH-500 and six Open LLM Leaderboard benchmarks
Addresses limitations of average accuracy metrics

New Framework for Evaluating LLMs Using Item Response Theory

Key facts

Entities

Institutions

Sources