Entropy Centroids as Intrinsic Rewards for Test-Time Scaling
A new method uses entropy centroids as intrinsic rewards to scale test-time compute for large language models, avoiding external reward models. The approach leverages the observation that high-entropy tokens cluster into consecutive groups during inference, providing stable uncertainty signals. This temporal structure is formalized into segment-level rewards, offering an alternative to confidence-based or entropy-based methods that suffer from noise. The work is published on arXiv under ID 2604.26173.
Key facts
- Method uses entropy centroids as intrinsic rewards
- Avoids external reward models
- High-entropy tokens cluster into consecutive groups
- Provides stable model uncertainty signals
- Formalizes segment-level rewards
- Published on arXiv: 2604.26173
- Related to Grok Heavy and Gemini Deep Think
- Addresses test-time compute scaling
Entities
Institutions
- arXiv