Hybrid Policy Optimization for Discrete-Continuous Action Spaces

other · 2026-05-16

A new reinforcement learning method, Hybrid Policy Optimization (HPO), addresses the challenge of hybrid discrete-continuous action spaces common in robotics, control, and operations. Standard model-free policy gradient methods using score-function estimators suffer from credit-assignment issues in high-dimensional settings. Differentiable simulation backpropagates through a simulator but yields biased or uninformative gradients for discrete actions or non-smooth dynamics. HPO combines pathwise and score-function gradients, backpropagating through the simulator where smoothness permits, to maintain unbiasedness. The method is detailed in arXiv:2605.14297.

Key facts

HPO addresses hybrid discrete-continuous action spaces
Standard score-function estimators suffer from credit-assignment issues
Differentiable simulation yields biased gradients for discrete actions
HPO combines pathwise and score-function gradients
HPO maintains unbiasedness
Method is described in arXiv:2605.14297
Applications include robotics, control, and operations

Entities

—

Sources

arXiv cs.AI — 2026-05-16