Long-Horizon Q-Learning Stabilizes Off-Policy RL via n-Step Inequalities
A new approach called long-horizon Q-learning (LQL) has been introduced by researchers to enhance the stability of off-policy value-based reinforcement learning by tackling the compounding errors that arise from bootstrapping. LQL utilizes an important observation regarding optimality: any sequence of actions taken provides a lower bound on the expected return of the optimal policy. Thus, making optimal decisions earlier should not yield worse outcomes than adhering to previously observed actions for a few steps before reverting to optimal behavior. The main innovation lies in transforming this inequality into a practical stabilization method for Q-learning through an n-step horizon. This methodology is elaborated in a paper available on arXiv (2605.05812).
Key facts
- LQL addresses compounding errors in Q-learning from bootstrapping.
- It builds on a prior optimality tightening observation.
- The method uses n-step inequalities to stabilize learning.
- The paper is available on arXiv with ID 2605.05812.
- LQL is designed for off-policy, value-based reinforcement learning.
Entities
Institutions
- arXiv