Q-Value Iteration Reaches Optimal Policy in Finite Time via Geometry
A recent study on discounted Q-value iteration (Q-VI) indicates that the algorithm achieves an optimal greedy policy in a finite timeframe, rather than merely approaching it asymptotically. This research presents the notion of the practically optimal solution set (POSS), which consists of Q-functions that yield optimal tie-broken greedy policies. The key finding demonstrates that Q-VI stabilizes within an invariant tube surrounding the affine space X1 = Q* + span(1), situated in the POSS, after a limited number of iterations. For any epsilon > 0, the distance to X1 diminishes exponentially at a rate of (rho_bar + epsilon)^k, where rho_bar refers to the joint spectral radius of the projected switching family in directions perpendicular to X1. When rho_bar < gamma, this convergence surpasses the typical gamma-contraction limit. The research reinterprets Q-VI as a switching system, offering a geometric viewpoint on policy identification. The paper can be found on arXiv with ID 2604.17457.
Key facts
- Q-value iteration is analyzed as a switching system.
- The practically optimal solution set (POSS) is defined as Q-functions with optimal tie-broken greedy policies.
- Q-VI reaches the optimal action class in finite time.
- Convergence occurs by entering an invariant tube around X1 = Q* + span(1).
- Distance to X1 decays exponentially with rate (rho_bar + epsilon)^k.
- rho_bar is the joint spectral radius of the projected switching family.
- When rho_bar < gamma, transverse convergence is faster than gamma-contraction.
- The paper is on arXiv with ID 2604.17457.
Entities
Institutions
- arXiv