Maximum Entropy Adjoint Matching Improves Offline RL Policy Optimization

publication · 2026-05-09

A new paper on arXiv (2605.06156) proposes Maximum Entropy Adjoint Matching (ME-AM), a framework to address limitations in offline reinforcement learning with flow-matching policies. Existing Q-learning with Adjoint Matching (QAM) suffers from popularity bias and support binding, which suppress high-reward actions in low-density regions and restrict off-manifold exploration. ME-AM incorporates Mirror Descent entropy maximization to overcome these issues within the continuous flow formulation, offering a unified solution without the expressivity bottlenecks of residual Gaussian policies.

Key facts

Paper arXiv:2605.06156 proposes Maximum Entropy Adjoint Matching (ME-AM)
ME-AM addresses popularity bias and support binding in offline RL
Q-learning with Adjoint Matching (QAM) is the baseline method
ME-AM uses Mirror Descent entropy maximization
The framework operates within continuous flow formulation
Residual Gaussian policies reintroduce expressivity bottlenecks
ME-AM unifies solutions to limitations of QAM
The paper is a cross-type announcement on arXiv

Maximum Entropy Adjoint Matching Improves Offline RL Policy Optimization

Key facts

Entities

Institutions

Sources