Reinforcement Learning with Noisy Verifiers
A recent paper on arXiv (2510.00915) tackles the challenge posed by unreliable verifiers in Reinforcement Learning with Verifiable Rewards (RLVR). This approach aims to substitute expensive human labeling with automated verifiers; however, the use of binarized rewards can lead to both false positives and false negatives. The researchers define the unreliability of verifiers as a stochastic reward channel characterized by asymmetric noise rates ρ0 for false positives and ρ1 for false negatives. They present two simple corrections: one backward correction that produces an unbiased surrogate reward and policy-gradient estimator, and a forward correction that adjusts score-function terms to ensure the expected update aligns with the clean gradient direction, relying solely on the false negative rate. These corrections are integrated as hooks in group relative policy optimization.
Key facts
- arXiv:2510.00915v4
- RLVR replaces human labeling with automated verifiers
- Binarized rewards to {0,1} reduce verifier hacking
- Imperfect verifiers cause false negatives and false positives
- Formalized as stochastic reward channel with asymmetric noise rates ρ0 and ρ1
- Backward correction yields unbiased surrogate reward and policy-gradient estimator
- Forward correction reweights score-function terms
- Implemented as lightweight hooks in group relative policy optimization
Entities
Institutions
- arXiv