RLVR Boosts Pass@1 but Not Pass@k in LLMs

ai-technology · 2026-05-20

A recent study published on arXiv (2605.18864) examines the effectiveness of reinforcement learning with verifiable rewards (RLVR) in allowing large language models to develop new reasoning skills versus simply improving sampling efficiency. The researchers discovered that RLVR consistently enhances pass@1 scores on reasoning tasks but does not produce similar improvements in pass@k, indicating a lack of exploration. They pinpoint reverse-KL regularization as a crucial structural limitation that keeps the policy aligned with the reference distribution, hindering alternative reasoning approaches. Eliminating the KL term or substituting it with forward-KL does not resolve the issue effectively.

Key facts

arXiv:2605.18864
RLVR improves pass@1 but not pass@k
Reverse-KL regularization anchors policy to reference distribution
Neither removing KL nor forward-KL solves the issue

Entities

Institutions

arXiv

Sources

arXiv cs.AI — 2026-05-20