Reliable Change Index Adapted for LLM Version Comparison
Researchers took the Reliable Change Index (RCI) from clinical psychology and repurposed it to assess language model versions at a granular level, using 2,000 MMLU-Pro items. They compared two versions within the same model family: Llama 3 to 3.1, which showed an increase of 1.6 points, and Qwen 2.5 to 3, with a rise of 2.8 points. Most items didn’t show significant change—79% for Llama and 72% for Qwen. Yet, over half were at extreme performance levels. When looking at analyzable items, changes were mixed; 34% of Llama items did better, while 28% did worse, and for Qwen, 47% improved against 39% that declined. Notably, Llama struggled in physics, while Qwen stumbled in law. The single-shot evaluation missed 42% of reliably changed items and misclassified 25% that didn’t change.
Key facts
- RCI adapted from clinical psychology to LLM version comparison
- 2,000 MMLU-Pro items used with K=10 samples at T=0.7
- Llama 3 to 3.1: +1.6 points aggregate gain
- Qwen 2.5 to 3: +2.8 points aggregate gain
- 79% of items showed no reliable change for Llama, 72% for Qwen
- Over half of items were floor/ceiling
- Among analysable items: 34% improved, 28% deteriorated for Llama
- Among analysable items: 47% improved, 39% deteriorated for Qwen
- Median |delta p| = 0.50 for Llama, 0.90 for Qwen
- Low-accuracy items improved, high-accuracy items deteriorated
- Llama lost physics, Qwen lost law
- Greedy single-shot missed 42% of reliably changed items
- Greedy single-shot falsely flagged 25% of unchanged items
Entities
Institutions
- arXiv