LLMs Achieve 100% Single-Needle Retrieval at 1M Tokens but Struggle with Multi-Hop Reasoning
A recent study assesses five advanced large language models that claim to have 1M-token context windows, utilizing a classical Chinese dataset. In Test 1, single-needle retrieval is evaluated at 1M tokens, incorporating three biographical needles at varying depths, and employs both real and modified variants to differentiate between in-context retrieval and memorization. Gemini 3.1 Pro, Claude Opus 4.7, and GPT-5.5 all attain perfect accuracy. Test 2 examines multi-hop chain traversal across three context levels (256K, 512K, 1M tokens), highlighting a noticeable decline in performance as the number of reasoning steps increases.
Key facts
- Five frontier LLMs with 1M-token context windows were evaluated.
- Test 1: single-needle retrieval at 1M tokens with three biographical needles.
- Needles planted at three depths with real and altered variants.
- Gemini 3.1 Pro, Claude Opus 4.7, and GPT-5.5 achieved 100% accuracy.
- Test 2: three-hop chain traversal across 256K, 512K, and 1M tokens.
- Multi-hop performance reveals distinct decay in capabilities.
- Study uses classical Chinese text corpus.
- Published on arXiv as 2605.02173v1.
Entities
Institutions
- arXiv