REFT: First-Token Diversification Boosts RLVR Rollout Diversity

ai-technology · 2026-05-28

Researchers identify the first token after the reasoning marker as a critical yet overlooked position for broadening rollout diversity in Reinforcement Learning with Verifiable Rewards (RLVR). The policy's first-token distribution exhibits a sharply peaked but correctness-decoupled phenomenon, enabling broader coverage without altering correctness signals. They introduce REFT (Rollout Exploration with First-Token Diversification), a lightweight method that samples first tokens uniformly from the policy's top-N candidates. This approach addresses a central bottleneck in RLVR, where rollout diversity is key to training reasoning models without labeled trajectories. The paper is available on arXiv under ID 2605.28295.

Key facts

REFT stands for Rollout Exploration with First-Token Diversification
The method targets the first token after the reasoning marker
First-token distribution is sharply peaked yet correctness-decoupled
REFT samples first tokens uniformly from top-N candidates
RLVR trains reasoning models without labeled trajectories
Rollout diversity is a central bottleneck in RLVR
Existing methods use temperature, prefix, or rollout-selection adjustments
Paper available on arXiv: 2605.28295

REFT: First-Token Diversification Boosts RLVR Rollout Diversity

Key facts

Entities

Institutions

Sources