Offline-to-Online RL: Adaptive Policy Selection Under Interaction Budgets
A new paper on arXiv (2605.05123) addresses challenges in offline-to-online reinforcement learning (O2O-RL), where policies are first trained on static datasets and then fine-tuned with limited online interactions. The authors identify two key issues: off-policy evaluation (OPE) can be unreliable, leading to risky policy deployment, while online evaluation (OE) may consume valuable interaction budget that could be used for fine-tuning. Additionally, it is often impossible to know a priori whether a pretrained policy will improve after deployment, especially in non-stationary environments. The paper proposes an adaptive method for selecting and fine-tuning policies under interaction budgets, aiming to balance exploration and exploitation without committing to a single policy upfront.
Key facts
- Paper arXiv:2605.05123 addresses offline-to-online reinforcement learning.
- Policies are first trained offline using previously collected datasets.
- Fine-tuning occurs via limited online interactions.
- Candidate policies are evaluated using off-policy evaluation (OPE) or online evaluation (OE).
- OPE can be unreliable, making deployment risky.
- OE may require substantial online interaction that could be used for fine-tuning.
- It is often not possible to determine if a pretrained policy will improve with fine-tuning.
- Non-stationary environments complicate policy improvement prediction.
Entities
Institutions
- arXiv