Q-Value Iteration Reaches Optimal Policy in Finite Time via Geometry

other · 2026-05-07

A recent study on discounted Q-value iteration (Q-VI) indicates that the algorithm achieves an optimal greedy policy in a finite timeframe, rather than merely approaching it asymptotically. This research presents the notion of the practically optimal solution set (POSS), which consists of Q-functions that yield optimal tie-broken greedy policies. The key finding demonstrates that Q-VI stabilizes within an invariant tube surrounding the affine space X1 = Q* + span(1), situated in the POSS, after a limited number of iterations. For any epsilon > 0, the distance to X1 diminishes exponentially at a rate of (rho_bar + epsilon)^k, where rho_bar refers to the joint spectral radius of the projected switching family in directions perpendicular to X1. When rho_bar < gamma, this convergence surpasses the typical gamma-contraction limit. The research reinterprets Q-VI as a switching system, offering a geometric viewpoint on policy identification. The paper can be found on arXiv with ID 2604.17457.

Key facts

Q-value iteration is analyzed as a switching system.
The practically optimal solution set (POSS) is defined as Q-functions with optimal tie-broken greedy policies.
Q-VI reaches the optimal action class in finite time.
Convergence occurs by entering an invariant tube around X1 = Q* + span(1).
Distance to X1 decays exponentially with rate (rho_bar + epsilon)^k.
rho_bar is the joint spectral radius of the projected switching family.
When rho_bar < gamma, transverse convergence is faster than gamma-contraction.
The paper is on arXiv with ID 2604.17457.

Q-Value Iteration Reaches Optimal Policy in Finite Time via Geometry

Key facts

Entities

Institutions

Sources