LLMs May Rote Memorize Benchmarks, Inflating Scores

ai-technology · 2026-05-18

A recent study published on arXiv (2504.08300) calls into question the effectiveness of benchmark-based assessments for Large Language Models (LLMs), including multiple-choice questions (MCQs) and open-ended questions (OEQs). The authors reinterpret benchmark contamination as a fundamental aspect of the learning process, seeking to distinguish between authentic skill development and mere memorization. Interestingly, they observed that LLMs tend to perform worse on benchmarks that have been memorized compared to those that are not, indicating a coexistence of rote learning and true comprehension. The research suggests strategies to clarify these issues, emphasizing that performance boosts from contamination can lead to misleading evaluation outcomes.

Key facts

Study reframes benchmark contamination as inherent to learning.
LLMs perform worse on memorized benchmarks than non-memorized ones.
Rote memorization and genuine capability learning coexist.
Benchmark contamination inflates LLM evaluation scores.
Multiple-choice and open-ended questions are common evaluation methods.
Research aims to disentangle memorization from genuine learning.
Pre-exposure to testing benchmarks during training causes inflated performance.
Findings challenge reliability of current LLM evaluation benchmarks.

LLMs May Rote Memorize Benchmarks, Inflating Scores

Key facts

Entities

Institutions

Sources