EXPO: Stable Reinforcement Learning with Expressive Policies
A new algorithm called Expressive Policy Optimization (EXPO) addresses the challenge of training expressive policies like diffusion and flow-matching models with online reinforcement learning (RL) from offline datasets. Unlike simpler Gaussian policies, expressive policies involve a long denoising chain that hinders stable gradient propagation. EXPO avoids direct optimization over value by constructing an on-the-fly RL policy to maximize Q-value, enabling sample-efficient online RL with two parameterized policies. The research is detailed in arXiv:2507.07986v3.
Key facts
- EXPO stands for Expressive Policy Optimization.
- It is an online RL algorithm for training expressive policies.
- Expressive policies include diffusion and flow-matching models.
- The algorithm uses an on-the-fly policy to maximize Q-value.
- It avoids direct optimization over value with the expressive policy.
- EXPO is designed to be sample-efficient.
- It utilizes two parameterized policies: a larger expressive base policy and an on-the-fly policy.
- The paper is available on arXiv with ID 2507.07986v3.
Entities
Institutions
- arXiv