One-Way Policy Optimization Enhances LLM Reasoning with Verifiable Rewards
A new method called One-Way Policy Optimization (OWPO) addresses inefficiencies in Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs). RLVR uses binary verifier rewards to scale reasoning capabilities, but suffers from low efficiency and optimization instability due to sparse rewards. Existing methods impose token-level constraints relative to a reference policy, which penalizes deviations indiscriminately and can suppress gains by flipping the verifier-determined direction when the policy tries to outperform the reference. OWPO decouples optimization direction from update magnitude: the verifier dictates the direction, while the reference policy only adjusts magnitude. It applies asymmetric reweighting, performing Accelerated Alignment for inferior deviations. The paper is published on arXiv with ID 2605.22156.
Key facts
- OWPO is a new method for RLVR in LLMs.
- RLVR uses binary verifier rewards to scale reasoning.
- Existing token-level constraints can flip verifier-determined direction.
- OWPO decouples optimization direction from update magnitude.
- The verifier dictates update direction in OWPO.
- The reference policy adjusts magnitude in OWPO.
- OWPO applies asymmetric reweighting.
- The paper is on arXiv: 2605.22156.
Entities
Institutions
- arXiv