Selector-Guided Curriculum Boosts One-Shot RLVR for LLMs

ai-technology · 2026-05-06

A new approach called Selector-Guided Autonomous Curriculum (SGAC) improves one-shot Reinforcement Learning from Verifiable Rewards (RLVR) for Large Language Models (LLMs). Current state-of-the-art methods use heuristics based on historical reward variance to select training instances, but this is misleading as a measure of transferability. SGAC employs a learnable selector model that considers a multi-dimensional feature space including success probability, reward variance, output disagreement (entropy), and semantic difficulty. Empirical evaluation shows output disagreement is the strongest predictor of reasoning gains, outperforming reward variance. The method was tested on pools of candidate problems.

Key facts

SGAC uses a learnable selector model for instance selection in RLVR
Current heuristics based on reward variance are misleading
Output disagreement is the strongest predictor of reasoning gains
Feature space includes success probability, reward variance, entropy, and semantic difficulty
Empirical evaluation conducted on pools of candidate problems
RLVR enhances math reasoning skills of LLMs from a single instance
Paper published on arXiv with ID 2605.01823
SGAC stands for Selector-Guided Autonomous Curriculum

Selector-Guided Curriculum Boosts One-Shot RLVR for LLMs

Key facts

Entities

Institutions

Sources