Trust Region Q-Adjoint Matching for Stable RL Fine-Tuning

other · 2026-05-27

Researchers have introduced Trust Region Q-Adjoint Matching (TRQAM), an effective off-policy fine-tuning method designed for pretrained flow policies. The off-policy reinforcement learning of these policies presents difficulties due to optimization instability from multi-step sampling. Q-learning with Adjoint Matching (QAM) tackled this issue by transforming it into a memoryless stochastic optimal control (SOC) problem using a learned critic; however, it is prone to fragility, as minor errors in the critic can escalate when the critics are poorly conditioned, resulting in model failure. TRQAM manages the path-space KL divergence adaptively with pretrained flow policies via projected dual descent. It fine-tunes the trust-region parameter λ in SOC dynamics and demonstrates that the path-space KL can be expressed as a closed-form function of λ, allowing for accurate control. The study is available on arXiv under ID 2605.27079.

Key facts

TRQAM is a stable off-policy fine-tuning algorithm for pretrained flow policies.
Off-policy RL of flow policies is challenging due to multi-step sampling instability.
QAM reformulates the problem into a memoryless SOC problem with a learned critic.
QAM suffers from fragility: small critic errors amplify when critics are ill-conditioned.
TRQAM adaptively controls path-space KL via projected dual descent.
The trust-region parameter λ is optimized in SOC dynamics.
Path-space KL is represented as a closed-form function of λ.
Paper ID: arXiv:2605.27079.

Trust Region Q-Adjoint Matching for Stable RL Fine-Tuning

Key facts

Entities

Institutions

Sources