VI-CuRL: Verifier-Free RL for LLM Reasoning via Confidence-Guided Variance Reduction

ai-technology · 2026-05-25

A new reinforcement learning framework, Verifier-Independent Curriculum Reinforcement Learning (VI-CuRL), addresses the challenge of destructive gradient variance in verifier-free training of Large Language Models (LLMs) for reasoning tasks. Standard methods like Group Relative Policy Optimization (GRPO) often suffer from training collapse due to high variance. VI-CuRL leverages the model's intrinsic confidence to construct a curriculum that prioritizes high-confidence samples, effectively managing the bias-variance trade-off by reducing action and problem variance. The framework is designed to stabilize training without relying on external verifiers, enhancing scalability. The paper, published on arXiv (2602.12579v2), provides a rigorous analysis and demonstrates the effectiveness of VI-CuRL in improving reasoning capabilities.

Key facts

VI-CuRL is a verifier-independent curriculum reinforcement learning framework.
It addresses destructive gradient variance in verifier-free LLM reasoning training.
Standard GRPO often leads to training collapse in verifier-free settings.
VI-CuRL uses the model's intrinsic confidence to prioritize high-confidence samples.
The framework reduces action and problem variance.
It manages the bias-variance trade-off without external verifiers.
The paper is published on arXiv with ID 2602.12579v2.
VI-CuRL aims to enhance scalability of RLVR for LLMs.

VI-CuRL: Verifier-Free RL for LLM Reasoning via Confidence-Guided Variance Reduction

Key facts

Entities

Institutions

Sources