LLM Benchmark Contamination Signals Are Fragile
A new study from arXiv (2509.00072v4) challenges the assumption that post-cutoff performance decay in large language models reliably indicates benchmark contamination. Researchers show that the temporal signal is highly sensitive to question construction: LLM-transformed questions produce different temporal patterns than cloze questions from the same documents. They validate this on LiveCodeBench, where a simple LLM-driven transformation removed the temporal pattern entirely. The work provides mechanistic understanding of this phenomenon, questioning the robustness of contamination detection via temporal signals.
Key facts
- Post-cutoff performance decay has been interpreted as a temporal signal for benchmark contamination.
- The study demonstrates that this temporal signal is highly sensitive to how benchmark questions are constructed.
- LLM-transformed questions can produce remarkably different temporal patterns compared to fill-in-the-blank (cloze) questions from the same documents.
- The effect was validated on LiveCodeBench, which previously showed clear post-cutoff decay.
- A simple LLM-driven transformation of the same problems can effectively remove the temporal pattern.
- The study provides a mechanistic understanding of this phenomenon.
- The paper is available on arXiv with ID 2509.00072.
- The research critically examines the view that public information released before training cutoff inflates model performance by memorization.
Entities
Institutions
- arXiv
- LiveCodeBench