Reinforcement Learning with Markov Risk Measures and Multipattern Approximation
A novel category of Markov coherent risk measures, termed mini-batch measures, has been proposed for risk-averse finite-horizon Markov Decision Problems. Additionally, the research introduces multipattern risk-averse issues that extend linear systems. These theories are utilized in a feature-based Q-learning approach featuring multipattern Q-factor approximation, which achieves a high-probability regret bound of O(H^2 N^H sqrt(K)), where H represents the horizon, N denotes the mini-batch size, and K indicates the number of episodes. Furthermore, an efficient variant of the Q-learning technique is introduced, optimizing the policy evaluation phase. The theoretical findings are illustrated through a stochastic assignment scenario and a short-horizon multi-armed bandit challenge.
Key facts
- Introduces mini-batch Markov coherent risk measures.
- Defines multipattern risk-averse problems generalizing linear systems.
- Proposes feature-based Q-learning with multipattern Q-factor approximation.
- Proves regret bound O(H^2 N^H sqrt(K)).
- Proposes economical Q-learning version streamlining policy evaluation.
- Illustrated on stochastic assignment problem.
- Illustrated on short-horizon multi-armed bandit problem.
- H is horizon, N is mini-batch size, K is number of episodes.
Entities
Institutions
- arXiv