Bilevel Optimization for Zero-Sum Markov Games with PANDA Algorithm
Researchers propose PANDA, a penalty-based first-order policy-gradient method for bilevel optimization where the lower-level problem is a regularized min-max zero-sum Markov game. Unlike existing bilevel RL methods that assume a single-policy lower-level MDP, PANDA handles competitive structures arising in applications like incentive design. The method exploits the Nikaido-Isoda function to avoid computing upper-level hypergradients and does not require second-order information. This work addresses a gap in hierarchical RL with multiple interacting policies.
Key facts
- Bilevel optimization over saddle points of zero-sum Markov games
- PANDA: penalty-augmented Nikaido-Isoda descent-ascent
- Lower-level problem is a regularized min-max zero-sum Markov game
- Upper-level objective optimized through saddle-point equilibrium
- Avoids computing UL hypergradients
- No second-order information required
- Applicable to incentive design
- Published on arXiv with ID 2605.26654
Entities
Institutions
- arXiv