Maximum Entropy Adjoint Matching Improves Offline RL Policy Optimization
A new paper on arXiv (2605.06156) proposes Maximum Entropy Adjoint Matching (ME-AM), a framework to address limitations in offline reinforcement learning with flow-matching policies. Existing Q-learning with Adjoint Matching (QAM) suffers from popularity bias and support binding, which suppress high-reward actions in low-density regions and restrict off-manifold exploration. ME-AM incorporates Mirror Descent entropy maximization to overcome these issues within the continuous flow formulation, offering a unified solution without the expressivity bottlenecks of residual Gaussian policies.
Key facts
- Paper arXiv:2605.06156 proposes Maximum Entropy Adjoint Matching (ME-AM)
- ME-AM addresses popularity bias and support binding in offline RL
- Q-learning with Adjoint Matching (QAM) is the baseline method
- ME-AM uses Mirror Descent entropy maximization
- The framework operates within continuous flow formulation
- Residual Gaussian policies reintroduce expressivity bottlenecks
- ME-AM unifies solutions to limitations of QAM
- The paper is a cross-type announcement on arXiv
Entities
Institutions
- arXiv