SOPE Algorithm Stabilizes Off-Policy Evaluation for Online RL

other · 2026-05-09

The newly developed algorithm, SOPE (Stabilizing Off-Policy Evaluation), tackles the issue of integrating historical data into online reinforcement learning. This approach employs an actor-aligned Off-Policy Policy Evaluation (OPE) signal to serve as an automated early-stopping tool, allowing for dynamic management of offline training duration. SOPE assesses the critic on a separate validation subset based on the action distribution of the current policy, stopping gradient updates when the advantages of out-of-distribution data level off, thereby removing the need for manual schedule adjustments. The performance of the algorithm was tested across 25 continuous control tasks within the Minari benchmark suite.

Key facts

SOPE uses an actor-aligned OPE signal as an automated early-stopping mechanism.
It dynamically controls the length of offline training phases.
The critic is evaluated on a held-out validation split under the current policy's action distribution.
Gradient updates halt when out-of-distribution benefits saturate.
No manual schedule tuning is required.
Evaluated on 25 continuous control tasks from the Minari benchmark.
The work is published on arXiv with ID 2605.05863.
The approach eliminates the trade-off between computational cost and multi-stage pipelines.

SOPE Algorithm Stabilizes Off-Policy Evaluation for Online RL

Key facts

Entities

Institutions

Sources