ARTFEED — Contemporary Art Intelligence

R²VPO: A New Reinforcement Learning Method Without Clipping

ai-technology · 2026-05-27

A new reinforcement learning method called Ratio-Variance Regularized Policy Optimization (R²VPO) eliminates the need for heuristic clipping in on-policy algorithms. Standard on-policy RL uses clipping to enforce trust regions, but this truncates high-return updates. R²VPO constrains policy ratio variance as a principled local approximation, acting as a distributional soft brake that preserves gradient signals from novel discoveries and enables reuse of stale off-policy data. The method is implemented via a primal-dual optimization framework. Evaluations across 7 LLM scales (fast and slow reasoning) and 10 robotic control tasks demonstrate generality.

Key facts

  • Standard on-policy RL relies on heuristic clipping to enforce trust regions.
  • Clipping indiscriminately truncates high-return yet high-divergence updates.
  • R²VPO constrains policy ratio variance as a local approximation to trust-region constraints.
  • The approach acts as a distributional soft brake.
  • It preserves critical gradient signals from novel discoveries.
  • It enables reuse of stale, off-policy data.
  • R²VPO uses a primal-dual optimization framework.
  • Evaluated across 7 LLM scales and 10 robotic control tasks.

Entities

Sources