ARTFEED — Contemporary Art Intelligence

Selector-Guided Curriculum Boosts One-Shot RLVR for LLMs

ai-technology · 2026-05-06

A new approach called Selector-Guided Autonomous Curriculum (SGAC) improves one-shot Reinforcement Learning from Verifiable Rewards (RLVR) for Large Language Models (LLMs). Current state-of-the-art methods use heuristics based on historical reward variance to select training instances, but this is misleading as a measure of transferability. SGAC employs a learnable selector model that considers a multi-dimensional feature space including success probability, reward variance, output disagreement (entropy), and semantic difficulty. Empirical evaluation shows output disagreement is the strongest predictor of reasoning gains, outperforming reward variance. The method was tested on pools of candidate problems.

Key facts

  • SGAC uses a learnable selector model for instance selection in RLVR
  • Current heuristics based on reward variance are misleading
  • Output disagreement is the strongest predictor of reasoning gains
  • Feature space includes success probability, reward variance, entropy, and semantic difficulty
  • Empirical evaluation conducted on pools of candidate problems
  • RLVR enhances math reasoning skills of LLMs from a single instance
  • Paper published on arXiv with ID 2605.01823
  • SGAC stands for Selector-Guided Autonomous Curriculum

Entities

Institutions

  • arXiv

Sources