New Framework for Evaluating LLMs Using Item Response Theory
A team of researchers has introduced a scalable and interpretable framework for assessing large language models (LLMs) grounded in Item Response Theory (IRT). This framework redefines evaluation as a series of constrained matrix factorization subproblems, which allows for stable and efficient parameter estimation, backed by theoretical assurances of identifiability and convergence. Testing on both synthetic and real datasets, such as MATH-500 and six benchmarks from the Open LLM Leaderboard, showcases the method's efficacy. This approach overcomes the shortcomings of conventional benchmarking that depend on average accuracy while neglecting stochastic elements and variability.
Key facts
- Proposes an interpretable and scalable framework for LLM evaluation based on IRT
- Reformulates evaluation as constrained matrix factorization subproblems
- Provides theoretical guarantees for identifiability and convergence
- Tested on synthetic and real-world datasets including MATH-500 and six Open LLM Leaderboard benchmarks
- Addresses limitations of average accuracy metrics
Entities
Institutions
- arXiv
- Open LLM Leaderboard