ARTFEED — Contemporary Art Intelligence

LLM Benchmark Contamination Signals Are Fragile

ai-technology · 2026-05-14

A new study from arXiv (2509.00072v4) challenges the assumption that post-cutoff performance decay in large language models reliably indicates benchmark contamination. Researchers show that the temporal signal is highly sensitive to question construction: LLM-transformed questions produce different temporal patterns than cloze questions from the same documents. They validate this on LiveCodeBench, where a simple LLM-driven transformation removed the temporal pattern entirely. The work provides mechanistic understanding of this phenomenon, questioning the robustness of contamination detection via temporal signals.

Key facts

  • Post-cutoff performance decay has been interpreted as a temporal signal for benchmark contamination.
  • The study demonstrates that this temporal signal is highly sensitive to how benchmark questions are constructed.
  • LLM-transformed questions can produce remarkably different temporal patterns compared to fill-in-the-blank (cloze) questions from the same documents.
  • The effect was validated on LiveCodeBench, which previously showed clear post-cutoff decay.
  • A simple LLM-driven transformation of the same problems can effectively remove the temporal pattern.
  • The study provides a mechanistic understanding of this phenomenon.
  • The paper is available on arXiv with ID 2509.00072.
  • The research critically examines the view that public information released before training cutoff inflates model performance by memorization.

Entities

Institutions

  • arXiv
  • LiveCodeBench

Sources