TMPO: Trajectory Matching Policy Optimization for Diffusion Alignment

other · 2026-05-13

A new reinforcement learning method, Trajectory Matching Policy Optimization (TMPO), addresses reward hacking in diffusion model alignment. Unlike existing RL approaches that maximize expected reward and cause mode collapse, TMPO matches trajectory-level reward distributions using a Softmax Trajectory Balance objective. This objective ensures the policy's probability distribution over K trajectories aligns with a reward-induced Boltzmann distribution, preserving generative diversity. The method is presented in a paper on arXiv (2605.10983) and targets downstream tasks where diverse outputs are critical.

Key facts

TMPO replaces scalar reward maximization with trajectory-level reward distribution matching.
It introduces a Softmax Trajectory Balance (Softmax-TB) objective.
The objective matches policy probabilities of K trajectories to a reward-induced Boltzmann distribution.
TMPO aims to prevent mode collapse and reward hacking in diffusion models.
The paper is available on arXiv with ID 2605.10983.
The method is designed for aligning diffusion models to downstream tasks.
Existing RL methods suffer from mode-seeking behavior that reduces diversity.
TMPO inherits the mode-covering property of forward KL divergence.

TMPO: Trajectory Matching Policy Optimization for Diffusion Alignment

Key facts

Entities

Institutions

Sources