Reinforcement Learning with Verifiable Rewards Enhanced by Rare-Event Amplification
A new arXiv paper proposes a method to improve reinforcement learning with verifiable rewards (RLVR) for training large language models on deterministic reasoning tasks. The authors argue that effective prompt selection should provide both reliable positive anchors and explicit negative learning signals from rare failures. They introduce positive-negative pairing, sampling a hard-but-solvable prompt and an easy-but-brittle prompt, and Weighted GRPO to reweight binary outcomes. This approach aims to stabilize optimization and improve transfer performance.
Key facts
- Paper title: Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing
- arXiv ID: 2602.03452
- Announce type: replace-cross
- Focuses on reinforcement learning with verifiable rewards (RLVR)
- Proposes positive-negative pairing for prompt selection
- Introduces Weighted GRPO algorithm
- Aims to improve training stability and transfer
- Addresses limitations of variance-based prompt selection
Entities
Institutions
- arXiv