Reward Weighted Classifier-Free Guidance Enables Test-Time Optimization in Autoregressive Models
A new technique known as reward weighted classifier-free guidance (RCFG) acts as a policy enhancement operator for autoregressive models, enabling the optimization of reward functions without the need for retraining. This method approximates the adjustment of the sampling distribution via the Q function, overcoming the drawback of traditional reinforcement learning that necessitates complete retraining when reward functions are altered. RCFG has been effectively utilized in molecular generation, showcasing its capability to optimize new reward functions during testing. Furthermore, employing RCFG as a teacher and distilling it into the base policy provides a beneficial warm start. This research is detailed in arXiv preprint 2604.15577v1, which was noted as a cross-type publication. The method specifically addresses outputs summarized by attribute vectors, such as helpfulness versus harmlessness or bio-availability versus lipophilicity, with arbitrary reward functions reflecting trade-offs between these characteristics.
Key facts
- Reward weighted classifier-free guidance (RCFG) acts as a policy improvement operator in autoregressive models
- RCFG approximates tilting the sampling distribution by the Q function
- The method enables optimization of novel reward functions at test time without retraining
- Applied successfully to molecular generation tasks
- Using RCFG as a teacher and distilling into the base policy serves as a warm start
- Traditional reinforcement learning requires retraining when reward functions change
- Autoregressive models produce outputs summarized by attribute vectors
- Arbitrary reward functions encode tradeoffs between properties like helpfulness vs. harmlessness
Entities
Institutions
- arXiv