Tree MDPs Learned by Treating Policies as Bandit Arms
A new paper on arXiv (2605.04979) introduces an approach for online learning in Tree Markov Decision Problems (T-MDPs) by treating each policy as an arm in bandit algorithms. T-MDPs are finite-horizon MDPs where every state is reachable from the start state via a unique trajectory, naturally modeling sequential games with perfect recall against stationary opponents. The authors show that standard bandit algorithms like LUCB and UCB can be applied despite the exponential number of policies, by designing confidence bounds that share data across policies, enabling polynomial memory and per-step computation. Instance-dependent upper bounds on sample complexity and regret are provided.
Key facts
- Paper arXiv:2605.04979 published in 2025.
- Focuses on Tree Markov Decision Problems (T-MDPs).
- T-MDPs are finite-horizon MDPs with unique state-action trajectories.
- Applicable to sequential games with perfect recall.
- Treats each policy as an arm in bandit algorithms.
- Uses LUCB and UCB algorithms.
- Confidence bounds designed to share data across policies.
- Achieves polynomial memory and per-step computation.
- Provides instance-dependent upper bounds on sample complexity and regret.
Entities
Institutions
- arXiv