ARTFEED — Contemporary Art Intelligence

VI-CuRL: Verifier-Free RL for LLM Reasoning via Confidence-Guided Variance Reduction

ai-technology · 2026-05-25

A new reinforcement learning framework, Verifier-Independent Curriculum Reinforcement Learning (VI-CuRL), addresses the challenge of destructive gradient variance in verifier-free training of Large Language Models (LLMs) for reasoning tasks. Standard methods like Group Relative Policy Optimization (GRPO) often suffer from training collapse due to high variance. VI-CuRL leverages the model's intrinsic confidence to construct a curriculum that prioritizes high-confidence samples, effectively managing the bias-variance trade-off by reducing action and problem variance. The framework is designed to stabilize training without relying on external verifiers, enhancing scalability. The paper, published on arXiv (2602.12579v2), provides a rigorous analysis and demonstrates the effectiveness of VI-CuRL in improving reasoning capabilities.

Key facts

  • VI-CuRL is a verifier-independent curriculum reinforcement learning framework.
  • It addresses destructive gradient variance in verifier-free LLM reasoning training.
  • Standard GRPO often leads to training collapse in verifier-free settings.
  • VI-CuRL uses the model's intrinsic confidence to prioritize high-confidence samples.
  • The framework reduces action and problem variance.
  • It manages the bias-variance trade-off without external verifiers.
  • The paper is published on arXiv with ID 2602.12579v2.
  • VI-CuRL aims to enhance scalability of RLVR for LLMs.

Entities

Institutions

  • arXiv

Sources