Reliable Change Index Adapted for LLM Version Comparison

ai-technology · 2026-05-01

Researchers took the Reliable Change Index (RCI) from clinical psychology and repurposed it to assess language model versions at a granular level, using 2,000 MMLU-Pro items. They compared two versions within the same model family: Llama 3 to 3.1, which showed an increase of 1.6 points, and Qwen 2.5 to 3, with a rise of 2.8 points. Most items didn’t show significant change—79% for Llama and 72% for Qwen. Yet, over half were at extreme performance levels. When looking at analyzable items, changes were mixed; 34% of Llama items did better, while 28% did worse, and for Qwen, 47% improved against 39% that declined. Notably, Llama struggled in physics, while Qwen stumbled in law. The single-shot evaluation missed 42% of reliably changed items and misclassified 25% that didn’t change.

Key facts

RCI adapted from clinical psychology to LLM version comparison
2,000 MMLU-Pro items used with K=10 samples at T=0.7
Llama 3 to 3.1: +1.6 points aggregate gain
Qwen 2.5 to 3: +2.8 points aggregate gain
79% of items showed no reliable change for Llama, 72% for Qwen
Over half of items were floor/ceiling
Among analysable items: 34% improved, 28% deteriorated for Llama
Among analysable items: 47% improved, 39% deteriorated for Qwen
Median |delta p| = 0.50 for Llama, 0.90 for Qwen
Low-accuracy items improved, high-accuracy items deteriorated
Llama lost physics, Qwen lost law
Greedy single-shot missed 42% of reliably changed items
Greedy single-shot falsely flagged 25% of unchanged items

Reliable Change Index Adapted for LLM Version Comparison

Key facts

Entities

Institutions

Sources