Trust Region Q-Adjoint Matching for Stable RL Fine-Tuning
Researchers have introduced Trust Region Q-Adjoint Matching (TRQAM), an effective off-policy fine-tuning method designed for pretrained flow policies. The off-policy reinforcement learning of these policies presents difficulties due to optimization instability from multi-step sampling. Q-learning with Adjoint Matching (QAM) tackled this issue by transforming it into a memoryless stochastic optimal control (SOC) problem using a learned critic; however, it is prone to fragility, as minor errors in the critic can escalate when the critics are poorly conditioned, resulting in model failure. TRQAM manages the path-space KL divergence adaptively with pretrained flow policies via projected dual descent. It fine-tunes the trust-region parameter λ in SOC dynamics and demonstrates that the path-space KL can be expressed as a closed-form function of λ, allowing for accurate control. The study is available on arXiv under ID 2605.27079.
Key facts
- TRQAM is a stable off-policy fine-tuning algorithm for pretrained flow policies.
- Off-policy RL of flow policies is challenging due to multi-step sampling instability.
- QAM reformulates the problem into a memoryless SOC problem with a learned critic.
- QAM suffers from fragility: small critic errors amplify when critics are ill-conditioned.
- TRQAM adaptively controls path-space KL via projected dual descent.
- The trust-region parameter λ is optimized in SOC dynamics.
- Path-space KL is represented as a closed-form function of λ.
- Paper ID: arXiv:2605.27079.
Entities
Institutions
- arXiv