ARTFEED — Contemporary Art Intelligence

Maximum Entropy Adjoint Matching Improves Offline RL Policy Optimization

publication · 2026-05-09

A new paper on arXiv (2605.06156) proposes Maximum Entropy Adjoint Matching (ME-AM), a framework to address limitations in offline reinforcement learning with flow-matching policies. Existing Q-learning with Adjoint Matching (QAM) suffers from popularity bias and support binding, which suppress high-reward actions in low-density regions and restrict off-manifold exploration. ME-AM incorporates Mirror Descent entropy maximization to overcome these issues within the continuous flow formulation, offering a unified solution without the expressivity bottlenecks of residual Gaussian policies.

Key facts

  • Paper arXiv:2605.06156 proposes Maximum Entropy Adjoint Matching (ME-AM)
  • ME-AM addresses popularity bias and support binding in offline RL
  • Q-learning with Adjoint Matching (QAM) is the baseline method
  • ME-AM uses Mirror Descent entropy maximization
  • The framework operates within continuous flow formulation
  • Residual Gaussian policies reintroduce expressivity bottlenecks
  • ME-AM unifies solutions to limitations of QAM
  • The paper is a cross-type announcement on arXiv

Entities

Institutions

  • arXiv

Sources