LLM Benchmark Contamination Signals Are Fragile

ai-technology · 2026-05-14

A new study from arXiv (2509.00072v4) challenges the assumption that post-cutoff performance decay in large language models reliably indicates benchmark contamination. Researchers show that the temporal signal is highly sensitive to question construction: LLM-transformed questions produce different temporal patterns than cloze questions from the same documents. They validate this on LiveCodeBench, where a simple LLM-driven transformation removed the temporal pattern entirely. The work provides mechanistic understanding of this phenomenon, questioning the robustness of contamination detection via temporal signals.

Key facts

Post-cutoff performance decay has been interpreted as a temporal signal for benchmark contamination.
The study demonstrates that this temporal signal is highly sensitive to how benchmark questions are constructed.
LLM-transformed questions can produce remarkably different temporal patterns compared to fill-in-the-blank (cloze) questions from the same documents.
The effect was validated on LiveCodeBench, which previously showed clear post-cutoff decay.
A simple LLM-driven transformation of the same problems can effectively remove the temporal pattern.
The study provides a mechanistic understanding of this phenomenon.
The paper is available on arXiv with ID 2509.00072.
The research critically examines the view that public information released before training cutoff inflates model performance by memorization.

LLM Benchmark Contamination Signals Are Fragile

Key facts

Entities

Institutions

Sources