Base LLMs Fail at Early Planning Tokens, Study Finds

ai-technology · 2026-05-20

A recent study published on arXiv (2605.16874) indicates that large reasoning models (LRMs) significantly surpass base LLMs in reasoning tasks. However, this disparity is largely focused on a limited number of initial decision tokens. Researchers examining Qwen3-0.6B discovered that merely ~8% of the tokens generated highlight the notable differences between base and reasoning models. These critical tokens emerge early in the responses, are 17 times more likely to be related to planning, and align with high uncertainty in base models. The results imply that base models primarily struggle at these early planning junctures, suggesting that refining a few decision tokens could enhance reasoning capabilities.

Key facts

arXiv:2605.16874
Large reasoning models (LRMs) outperform base LLMs on reasoning benchmarks
Base-reasoning gap studied via token-level distributional disagreement
Only ~8% of tokens account for salient disagreement on Qwen3-0.6B
Disagreement tokens concentrate early in responses
Disagreement tokens are 17x enriched in planning-related decisions
Disagreement tokens coincide with high base-model uncertainty
Base models fail mainly at early planning points

Entities

—

Sources

arXiv cs.AI — 2026-05-19