Study Reveals Positional Failures in Long-Context LLM Reasoning Benchmarks

ai-technology · 2026-05-25

A recent study available on arXiv (2605.23170) indicates that existing long-context reasoning benchmarks do not adequately account for the positioning of target tasks within lengthy sequences. The researchers examined 11 long-context benchmarks and discovered that none effectively manage task position, filler content, and context length simultaneously for reasoning tasks. An analysis of four prominent long-context model releases revealed that there were no entries in the main result tables for Needle-in-a-Haystack (NIAH), RULER, or LongBench-family benchmarks, whereas agentic and coding benchmarks were consistently present. To remedy this, the authors introduce Context Rot Evaluation (CRE), a controlled framework that manipulates all three variables. They assessed nine LLMs on GSM8K and ARC-Challenge in two phases: an initial five-model set and four newer vendor releases. Results indicate that models can experience significant performance drops when the target task shifts from the end to the middle of the context, with the decline exacerbated by longer context lengths. For example, MiMo-v2-Flash saw an 88 percentage point decrease at a 64K context length under specific conditions. This study underscores a significant oversight in the current evaluation methods for long-context LLMs.

Key facts

arXiv paper 2605.23170 audits 11 long-context benchmarks for positional control.
No benchmark jointly controls task position, filler content, and context length for reasoning.
Four flagship long-context releases lack main result-table entries for NIAH, RULER, or LongBench-family benchmarks.
Agentic and coding benchmarks appear in main result tables across all four releases.
Context Rot Evaluation (CRE) is proposed to vary task position, filler content, and context length.
Nine LLMs evaluated on GSM8K and ARC-Challenge in two rounds.
Models drop sharply when target task moves from end to middle of context.
MiMo-v2-Flash drops 88pp at 64K context length.
Drop worsens with longer context for vulnerable models.

Study Reveals Positional Failures in Long-Context LLM Reasoning Benchmarks

Key facts

Entities

Institutions

Sources