ARTFEED — Contemporary Art Intelligence

Reliable Change Index Adapted for LLM Version Comparison

ai-technology · 2026-05-01

Researchers took the Reliable Change Index (RCI) from clinical psychology and repurposed it to assess language model versions at a granular level, using 2,000 MMLU-Pro items. They compared two versions within the same model family: Llama 3 to 3.1, which showed an increase of 1.6 points, and Qwen 2.5 to 3, with a rise of 2.8 points. Most items didn’t show significant change—79% for Llama and 72% for Qwen. Yet, over half were at extreme performance levels. When looking at analyzable items, changes were mixed; 34% of Llama items did better, while 28% did worse, and for Qwen, 47% improved against 39% that declined. Notably, Llama struggled in physics, while Qwen stumbled in law. The single-shot evaluation missed 42% of reliably changed items and misclassified 25% that didn’t change.

Key facts

  • RCI adapted from clinical psychology to LLM version comparison
  • 2,000 MMLU-Pro items used with K=10 samples at T=0.7
  • Llama 3 to 3.1: +1.6 points aggregate gain
  • Qwen 2.5 to 3: +2.8 points aggregate gain
  • 79% of items showed no reliable change for Llama, 72% for Qwen
  • Over half of items were floor/ceiling
  • Among analysable items: 34% improved, 28% deteriorated for Llama
  • Among analysable items: 47% improved, 39% deteriorated for Qwen
  • Median |delta p| = 0.50 for Llama, 0.90 for Qwen
  • Low-accuracy items improved, high-accuracy items deteriorated
  • Llama lost physics, Qwen lost law
  • Greedy single-shot missed 42% of reliably changed items
  • Greedy single-shot falsely flagged 25% of unchanged items

Entities

Institutions

  • arXiv

Sources