ARTFEED — Contemporary Art Intelligence

REFT: First-Token Diversification Boosts RLVR Rollout Diversity

ai-technology · 2026-05-28

Researchers identify the first token after the reasoning marker as a critical yet overlooked position for broadening rollout diversity in Reinforcement Learning with Verifiable Rewards (RLVR). The policy's first-token distribution exhibits a sharply peaked but correctness-decoupled phenomenon, enabling broader coverage without altering correctness signals. They introduce REFT (Rollout Exploration with First-Token Diversification), a lightweight method that samples first tokens uniformly from the policy's top-N candidates. This approach addresses a central bottleneck in RLVR, where rollout diversity is key to training reasoning models without labeled trajectories. The paper is available on arXiv under ID 2605.28295.

Key facts

  • REFT stands for Rollout Exploration with First-Token Diversification
  • The method targets the first token after the reasoning marker
  • First-token distribution is sharply peaked yet correctness-decoupled
  • REFT samples first tokens uniformly from top-N candidates
  • RLVR trains reasoning models without labeled trajectories
  • Rollout diversity is a central bottleneck in RLVR
  • Existing methods use temperature, prefix, or rollout-selection adjustments
  • Paper available on arXiv: 2605.28295

Entities

Institutions

  • arXiv

Sources