ARTFEED — Contemporary Art Intelligence

Statistical Framework Detects LLM Degradation

ai-technology · 2026-05-07

A new statistical approach uses McNemar's test to detect degradation in large language models (LLMs) even after theoretically lossless optimizations. The method compares model outputs on paired samples to identify significant accuracy drops while controlling false positives. This addresses the problem that numerical errors can cause non-robust generations at temperature zero. The framework is proposed as a tool for ensuring model quality during inference cost and latency reduction efforts.

Key facts

  • arXiv:2602.10144v2
  • McNemar's test is used for hypothesis testing
  • Detects model degradation from numerical errors
  • Controls false positive rate
  • Applicable to theoretically lossless optimizations
  • Addresses non-robust generations at temperature zero

Entities

Sources