Base LLMs Fail at Early Planning Tokens, Study Finds
A recent study published on arXiv (2605.16874) indicates that large reasoning models (LRMs) significantly surpass base LLMs in reasoning tasks. However, this disparity is largely focused on a limited number of initial decision tokens. Researchers examining Qwen3-0.6B discovered that merely ~8% of the tokens generated highlight the notable differences between base and reasoning models. These critical tokens emerge early in the responses, are 17 times more likely to be related to planning, and align with high uncertainty in base models. The results imply that base models primarily struggle at these early planning junctures, suggesting that refining a few decision tokens could enhance reasoning capabilities.
Key facts
- arXiv:2605.16874
- Large reasoning models (LRMs) outperform base LLMs on reasoning benchmarks
- Base-reasoning gap studied via token-level distributional disagreement
- Only ~8% of tokens account for salient disagreement on Qwen3-0.6B
- Disagreement tokens concentrate early in responses
- Disagreement tokens are 17x enriched in planning-related decisions
- Disagreement tokens coincide with high base-model uncertainty
- Base models fail mainly at early planning points
Entities
—