Reinforcement Learning with Verifiable Rewards Enhanced by Rare-Event Amplification

ai-technology · 2026-05-07

A new arXiv paper proposes a method to improve reinforcement learning with verifiable rewards (RLVR) for training large language models on deterministic reasoning tasks. The authors argue that effective prompt selection should provide both reliable positive anchors and explicit negative learning signals from rare failures. They introduce positive-negative pairing, sampling a hard-but-solvable prompt and an easy-but-brittle prompt, and Weighted GRPO to reweight binary outcomes. This approach aims to stabilize optimization and improve transfer performance.

Key facts

Paper title: Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing
arXiv ID: 2602.03452
Announce type: replace-cross
Focuses on reinforcement learning with verifiable rewards (RLVR)
Proposes positive-negative pairing for prompt selection
Introduces Weighted GRPO algorithm
Aims to improve training stability and transfer
Addresses limitations of variance-based prompt selection

Reinforcement Learning with Verifiable Rewards Enhanced by Rare-Event Amplification

Key facts

Entities

Institutions

Sources