EXPO: Stable Reinforcement Learning with Expressive Policies

other · 2026-05-01

A new algorithm called Expressive Policy Optimization (EXPO) addresses the challenge of training expressive policies like diffusion and flow-matching models with online reinforcement learning (RL) from offline datasets. Unlike simpler Gaussian policies, expressive policies involve a long denoising chain that hinders stable gradient propagation. EXPO avoids direct optimization over value by constructing an on-the-fly RL policy to maximize Q-value, enabling sample-efficient online RL with two parameterized policies. The research is detailed in arXiv:2507.07986v3.

Key facts

EXPO stands for Expressive Policy Optimization.
It is an online RL algorithm for training expressive policies.
Expressive policies include diffusion and flow-matching models.
The algorithm uses an on-the-fly policy to maximize Q-value.
It avoids direct optimization over value with the expressive policy.
EXPO is designed to be sample-efficient.
It utilizes two parameterized policies: a larger expressive base policy and an on-the-fly policy.
The paper is available on arXiv with ID 2507.07986v3.

EXPO: Stable Reinforcement Learning with Expressive Policies

Key facts

Entities

Institutions

Sources