ARTFEED — Contemporary Art Intelligence

EXPO: Stable Reinforcement Learning with Expressive Policies

other · 2026-05-01

A new algorithm called Expressive Policy Optimization (EXPO) addresses the challenge of training expressive policies like diffusion and flow-matching models with online reinforcement learning (RL) from offline datasets. Unlike simpler Gaussian policies, expressive policies involve a long denoising chain that hinders stable gradient propagation. EXPO avoids direct optimization over value by constructing an on-the-fly RL policy to maximize Q-value, enabling sample-efficient online RL with two parameterized policies. The research is detailed in arXiv:2507.07986v3.

Key facts

  • EXPO stands for Expressive Policy Optimization.
  • It is an online RL algorithm for training expressive policies.
  • Expressive policies include diffusion and flow-matching models.
  • The algorithm uses an on-the-fly policy to maximize Q-value.
  • It avoids direct optimization over value with the expressive policy.
  • EXPO is designed to be sample-efficient.
  • It utilizes two parameterized policies: a larger expressive base policy and an on-the-fly policy.
  • The paper is available on arXiv with ID 2507.07986v3.

Entities

Institutions

  • arXiv

Sources