R²VPO: A New Reinforcement Learning Method Without Clipping
A new reinforcement learning method called Ratio-Variance Regularized Policy Optimization (R²VPO) eliminates the need for heuristic clipping in on-policy algorithms. Standard on-policy RL uses clipping to enforce trust regions, but this truncates high-return updates. R²VPO constrains policy ratio variance as a principled local approximation, acting as a distributional soft brake that preserves gradient signals from novel discoveries and enables reuse of stale off-policy data. The method is implemented via a primal-dual optimization framework. Evaluations across 7 LLM scales (fast and slow reasoning) and 10 robotic control tasks demonstrate generality.
Key facts
- Standard on-policy RL relies on heuristic clipping to enforce trust regions.
- Clipping indiscriminately truncates high-return yet high-divergence updates.
- R²VPO constrains policy ratio variance as a local approximation to trust-region constraints.
- The approach acts as a distributional soft brake.
- It preserves critical gradient signals from novel discoveries.
- It enables reuse of stale, off-policy data.
- R²VPO uses a primal-dual optimization framework.
- Evaluated across 7 LLM scales and 10 robotic control tasks.
Entities
—