Statistical Framework Detects LLM Degradation

ai-technology · 2026-05-07

A new statistical approach uses McNemar's test to detect degradation in large language models (LLMs) even after theoretically lossless optimizations. The method compares model outputs on paired samples to identify significant accuracy drops while controlling false positives. This addresses the problem that numerical errors can cause non-robust generations at temperature zero. The framework is proposed as a tool for ensuring model quality during inference cost and latency reduction efforts.

Key facts

arXiv:2602.10144v2
McNemar's test is used for hypothesis testing
Detects model degradation from numerical errors
Controls false positive rate
Applicable to theoretically lossless optimizations
Addresses non-robust generations at temperature zero

Entities

—

Sources

arXiv cs.AI — 2026-05-07