RLVR Boosts Pass@1 but Not Pass@k in LLMs
A recent study published on arXiv (2605.18864) examines the effectiveness of reinforcement learning with verifiable rewards (RLVR) in allowing large language models to develop new reasoning skills versus simply improving sampling efficiency. The researchers discovered that RLVR consistently enhances pass@1 scores on reasoning tasks but does not produce similar improvements in pass@k, indicating a lack of exploration. They pinpoint reverse-KL regularization as a crucial structural limitation that keeps the policy aligned with the reference distribution, hindering alternative reasoning approaches. Eliminating the KL term or substituting it with forward-KL does not resolve the issue effectively.
Key facts
- arXiv:2605.18864
- RLVR improves pass@1 but not pass@k
- Reverse-KL regularization anchors policy to reference distribution
- Neither removing KL nor forward-KL solves the issue
Entities
Institutions
- arXiv