JURY-RL: Label-Free RLVR Framework Decouples Voting from Formal Verification

ai-technology · 2026-04-30

The newly introduced framework, JURY-RL, tackles the issue of false positives in label-free reinforcement learning with verifiable rewards (RLVR) specifically for large language models (LLMs). Traditional RLVR methods depend on human-generated answers or curated reward specifications, which can be expensive. While label-free methods like majority voting or using LLMs as judges eliminate annotation costs, they risk generating false positives that can disrupt training. JURY-RL separates the proposal of answers from the reward assignment: model rollouts suggest a candidate answer, and a formal verifier assesses if that answer qualifies for a positive reward. Only rollouts that align with the majority-voted answer receive rewards upon successful verification in Lean. If verification yields uncertain results, a fallback reward known as ResZero (Residual-Zero) discards the unverified majority proposal and reallocates a zero-mean, variance-preserving signal. This strategy seeks to enhance training stability in machine-checkable areas without the need for human annotations.

Key facts

JURY-RL is a label-free RLVR framework for LLMs.
It decouples answer proposal from reward disposal.
Votes from model rollouts propose a candidate answer.
A formal verifier determines if the candidate receives positive reward.
Only rollouts matching the plurality-voted answer are rewarded when verified in Lean.
ResZero fallback reward discards unverified plurality proposals.
ResZero redistributes a zero-mean, variance-preserving signal.
The framework addresses false positives from label-free methods.

JURY-RL: Label-Free RLVR Framework Decouples Voting from Formal Verification

Key facts

Entities

Institutions

Sources