REFT: First-Token Diversification Boosts RLVR Rollout Diversity
Researchers identify the first token after the reasoning marker as a critical yet overlooked position for broadening rollout diversity in Reinforcement Learning with Verifiable Rewards (RLVR). The policy's first-token distribution exhibits a sharply peaked but correctness-decoupled phenomenon, enabling broader coverage without altering correctness signals. They introduce REFT (Rollout Exploration with First-Token Diversification), a lightweight method that samples first tokens uniformly from the policy's top-N candidates. This approach addresses a central bottleneck in RLVR, where rollout diversity is key to training reasoning models without labeled trajectories. The paper is available on arXiv under ID 2605.28295.
Key facts
- REFT stands for Rollout Exploration with First-Token Diversification
- The method targets the first token after the reasoning marker
- First-token distribution is sharply peaked yet correctness-decoupled
- REFT samples first tokens uniformly from top-N candidates
- RLVR trains reasoning models without labeled trajectories
- Rollout diversity is a central bottleneck in RLVR
- Existing methods use temperature, prefix, or rollout-selection adjustments
- Paper available on arXiv: 2605.28295
Entities
Institutions
- arXiv