Score-Based One-step MeanFlow Policy Optimization
The Score-Based One-step MeanFlow Policy Optimization (SOM) represents an innovative actor-critic approach in reinforcement learning. This algorithm tackles the computational demands associated with diffusion and flow matching policies by establishing a direct one-step mapping from noise to data. By utilizing score estimation and a probability flow ODE, SOM derives the target velocity field straight from the Q-function, thereby removing the necessity for samples from the target distribution. In the realm of online reinforcement learning, SOM demonstrates leading performance in locomotion tasks, accomplishing this with just a single generation step.
Key facts
- SOM is an actor-critic algorithm for reinforcement learning.
- It uses a single-step mapping from noise to data.
- The target velocity field is constructed from the Q-function via score estimation and a probability flow ODE.
- SOM eliminates the need for samples from the target distribution.
- It achieves state-of-the-art performance on locomotion tasks in online RL.
- SOM requires only a single generation step.
- The method is based on MeanFlow, which learns an average velocity field.
- The paper is published on arXiv with ID 2605.23365.
Entities
Institutions
- arXiv