Entropy Centroids as Intrinsic Rewards for Test-Time Scaling

ai-technology · 2026-04-30

A new method uses entropy centroids as intrinsic rewards to scale test-time compute for large language models, avoiding external reward models. The approach leverages the observation that high-entropy tokens cluster into consecutive groups during inference, providing stable uncertainty signals. This temporal structure is formalized into segment-level rewards, offering an alternative to confidence-based or entropy-based methods that suffer from noise. The work is published on arXiv under ID 2604.26173.

Key facts

Method uses entropy centroids as intrinsic rewards
Avoids external reward models
High-entropy tokens cluster into consecutive groups
Provides stable model uncertainty signals
Formalizes segment-level rewards
Published on arXiv: 2604.26173
Related to Grok Heavy and Gemini Deep Think
Addresses test-time compute scaling

Entropy Centroids as Intrinsic Rewards for Test-Time Scaling

Key facts

Entities

Institutions

Sources