Hybrid Policy Optimization for Discrete-Continuous Action Spaces
A new reinforcement learning method, Hybrid Policy Optimization (HPO), addresses the challenge of hybrid discrete-continuous action spaces common in robotics, control, and operations. Standard model-free policy gradient methods using score-function estimators suffer from credit-assignment issues in high-dimensional settings. Differentiable simulation backpropagates through a simulator but yields biased or uninformative gradients for discrete actions or non-smooth dynamics. HPO combines pathwise and score-function gradients, backpropagating through the simulator where smoothness permits, to maintain unbiasedness. The method is detailed in arXiv:2605.14297.
Key facts
- HPO addresses hybrid discrete-continuous action spaces
- Standard score-function estimators suffer from credit-assignment issues
- Differentiable simulation yields biased gradients for discrete actions
- HPO combines pathwise and score-function gradients
- HPO maintains unbiasedness
- Method is described in arXiv:2605.14297
- Applications include robotics, control, and operations
Entities
—