Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs

other · 2026-05-20

Researchers from Li et al. (2024) have published findings on arXiv that evaluate regret bounds in reinforcement learning within episodic Markov Decision Processes (MDPs) featuring multinomial logistic transitions. The study identifies that current algorithms demonstrate a regret of O~(dH^2√T) based on dimensions, episode length, and total episodes. The authors introduce a constant, σ̄_T ≤ 1/2, that reflects the average variance of the optimal value function throughout the learning process. Their proposed algorithm attains a regret of O~(dH^2σ̄_T√T), thereby enhancing performance in structured MDPs while also addressing constraints in robust MDPs.

Key facts

Paper studies reinforcement learning for episodic MDPs with multinomial logistic transitions.
Existing regret bound is O~(dH^2√T) from Li et al. (2024).
Introduces problem-dependent constant σ̄_T ≤ 1/2 measuring normalized average variance.
Proposed algorithm achieves regret O~(dH^2σ̄_T√T).
For KL-constrained robust MDPs, σ̄_T = O(H^{-1}), reducing horizon dependence by √H.
Builds on logistic bandit works by Abeille et al., Faury et al., and Boudart et al.
Published on arXiv with ID 2605.19768.
Announce type is new.

Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs

Key facts

Entities

Institutions

Sources