JURY-RL: Label-Free RLVR Framework Decouples Voting from Formal Verification
The newly introduced framework, JURY-RL, tackles the issue of false positives in label-free reinforcement learning with verifiable rewards (RLVR) specifically for large language models (LLMs). Traditional RLVR methods depend on human-generated answers or curated reward specifications, which can be expensive. While label-free methods like majority voting or using LLMs as judges eliminate annotation costs, they risk generating false positives that can disrupt training. JURY-RL separates the proposal of answers from the reward assignment: model rollouts suggest a candidate answer, and a formal verifier assesses if that answer qualifies for a positive reward. Only rollouts that align with the majority-voted answer receive rewards upon successful verification in Lean. If verification yields uncertain results, a fallback reward known as ResZero (Residual-Zero) discards the unverified majority proposal and reallocates a zero-mean, variance-preserving signal. This strategy seeks to enhance training stability in machine-checkable areas without the need for human annotations.
Key facts
- JURY-RL is a label-free RLVR framework for LLMs.
- It decouples answer proposal from reward disposal.
- Votes from model rollouts propose a candidate answer.
- A formal verifier determines if the candidate receives positive reward.
- Only rollouts matching the plurality-voted answer are rewarded when verified in Lean.
- ResZero fallback reward discards unverified plurality proposals.
- ResZero redistributes a zero-mean, variance-preserving signal.
- The framework addresses false positives from label-free methods.
Entities
Institutions
- arXiv