ARTFEED — Contemporary Art Intelligence

Trust Region Q-Adjoint Matching for Stable RL Fine-Tuning

other · 2026-05-27

Researchers have introduced Trust Region Q-Adjoint Matching (TRQAM), an effective off-policy fine-tuning method designed for pretrained flow policies. The off-policy reinforcement learning of these policies presents difficulties due to optimization instability from multi-step sampling. Q-learning with Adjoint Matching (QAM) tackled this issue by transforming it into a memoryless stochastic optimal control (SOC) problem using a learned critic; however, it is prone to fragility, as minor errors in the critic can escalate when the critics are poorly conditioned, resulting in model failure. TRQAM manages the path-space KL divergence adaptively with pretrained flow policies via projected dual descent. It fine-tunes the trust-region parameter λ in SOC dynamics and demonstrates that the path-space KL can be expressed as a closed-form function of λ, allowing for accurate control. The study is available on arXiv under ID 2605.27079.

Key facts

  • TRQAM is a stable off-policy fine-tuning algorithm for pretrained flow policies.
  • Off-policy RL of flow policies is challenging due to multi-step sampling instability.
  • QAM reformulates the problem into a memoryless SOC problem with a learned critic.
  • QAM suffers from fragility: small critic errors amplify when critics are ill-conditioned.
  • TRQAM adaptively controls path-space KL via projected dual descent.
  • The trust-region parameter λ is optimized in SOC dynamics.
  • Path-space KL is represented as a closed-form function of λ.
  • Paper ID: arXiv:2605.27079.

Entities

Institutions

  • arXiv

Sources