LLMs Achieve 100% Single-Needle Retrieval at 1M Tokens but Struggle with Multi-Hop Reasoning

ai-technology · 2026-05-06

A recent study assesses five advanced large language models that claim to have 1M-token context windows, utilizing a classical Chinese dataset. In Test 1, single-needle retrieval is evaluated at 1M tokens, incorporating three biographical needles at varying depths, and employs both real and modified variants to differentiate between in-context retrieval and memorization. Gemini 3.1 Pro, Claude Opus 4.7, and GPT-5.5 all attain perfect accuracy. Test 2 examines multi-hop chain traversal across three context levels (256K, 512K, 1M tokens), highlighting a noticeable decline in performance as the number of reasoning steps increases.

Key facts

Five frontier LLMs with 1M-token context windows were evaluated.
Test 1: single-needle retrieval at 1M tokens with three biographical needles.
Needles planted at three depths with real and altered variants.
Gemini 3.1 Pro, Claude Opus 4.7, and GPT-5.5 achieved 100% accuracy.
Test 2: three-hop chain traversal across 256K, 512K, and 1M tokens.
Multi-hop performance reveals distinct decay in capabilities.
Study uses classical Chinese text corpus.
Published on arXiv as 2605.02173v1.

LLMs Achieve 100% Single-Needle Retrieval at 1M Tokens but Struggle with Multi-Hop Reasoning

Key facts

Entities

Institutions

Sources