Score-Based One-step MeanFlow Policy Optimization

other · 2026-05-25

The Score-Based One-step MeanFlow Policy Optimization (SOM) represents an innovative actor-critic approach in reinforcement learning. This algorithm tackles the computational demands associated with diffusion and flow matching policies by establishing a direct one-step mapping from noise to data. By utilizing score estimation and a probability flow ODE, SOM derives the target velocity field straight from the Q-function, thereby removing the necessity for samples from the target distribution. In the realm of online reinforcement learning, SOM demonstrates leading performance in locomotion tasks, accomplishing this with just a single generation step.

Key facts

SOM is an actor-critic algorithm for reinforcement learning.
It uses a single-step mapping from noise to data.
The target velocity field is constructed from the Q-function via score estimation and a probability flow ODE.
SOM eliminates the need for samples from the target distribution.
It achieves state-of-the-art performance on locomotion tasks in online RL.
SOM requires only a single generation step.
The method is based on MeanFlow, which learns an average velocity field.
The paper is published on arXiv with ID 2605.23365.

Score-Based One-step MeanFlow Policy Optimization

Key facts

Entities

Institutions

Sources