Offline-to-Online RL: Adaptive Policy Selection Under Interaction Budgets

other · 2026-05-07

A new paper on arXiv (2605.05123) addresses challenges in offline-to-online reinforcement learning (O2O-RL), where policies are first trained on static datasets and then fine-tuned with limited online interactions. The authors identify two key issues: off-policy evaluation (OPE) can be unreliable, leading to risky policy deployment, while online evaluation (OE) may consume valuable interaction budget that could be used for fine-tuning. Additionally, it is often impossible to know a priori whether a pretrained policy will improve after deployment, especially in non-stationary environments. The paper proposes an adaptive method for selecting and fine-tuning policies under interaction budgets, aiming to balance exploration and exploitation without committing to a single policy upfront.

Key facts

Paper arXiv:2605.05123 addresses offline-to-online reinforcement learning.
Policies are first trained offline using previously collected datasets.
Fine-tuning occurs via limited online interactions.
Candidate policies are evaluated using off-policy evaluation (OPE) or online evaluation (OE).
OPE can be unreliable, making deployment risky.
OE may require substantial online interaction that could be used for fine-tuning.
It is often not possible to determine if a pretrained policy will improve with fine-tuning.
Non-stationary environments complicate policy improvement prediction.

Offline-to-Online RL: Adaptive Policy Selection Under Interaction Budgets

Key facts

Entities

Institutions

Sources