ARTFEED — Contemporary Art Intelligence

Offline-to-Online RL: Adaptive Policy Selection Under Interaction Budgets

other · 2026-05-07

A new paper on arXiv (2605.05123) addresses challenges in offline-to-online reinforcement learning (O2O-RL), where policies are first trained on static datasets and then fine-tuned with limited online interactions. The authors identify two key issues: off-policy evaluation (OPE) can be unreliable, leading to risky policy deployment, while online evaluation (OE) may consume valuable interaction budget that could be used for fine-tuning. Additionally, it is often impossible to know a priori whether a pretrained policy will improve after deployment, especially in non-stationary environments. The paper proposes an adaptive method for selecting and fine-tuning policies under interaction budgets, aiming to balance exploration and exploitation without committing to a single policy upfront.

Key facts

  • Paper arXiv:2605.05123 addresses offline-to-online reinforcement learning.
  • Policies are first trained offline using previously collected datasets.
  • Fine-tuning occurs via limited online interactions.
  • Candidate policies are evaluated using off-policy evaluation (OPE) or online evaluation (OE).
  • OPE can be unreliable, making deployment risky.
  • OE may require substantial online interaction that could be used for fine-tuning.
  • It is often not possible to determine if a pretrained policy will improve with fine-tuning.
  • Non-stationary environments complicate policy improvement prediction.

Entities

Institutions

  • arXiv

Sources