Reward Weighted Classifier-Free Guidance Enables Test-Time Optimization in Autoregressive Models

ai-technology · 2026-04-20

A new technique known as reward weighted classifier-free guidance (RCFG) acts as a policy enhancement operator for autoregressive models, enabling the optimization of reward functions without the need for retraining. This method approximates the adjustment of the sampling distribution via the Q function, overcoming the drawback of traditional reinforcement learning that necessitates complete retraining when reward functions are altered. RCFG has been effectively utilized in molecular generation, showcasing its capability to optimize new reward functions during testing. Furthermore, employing RCFG as a teacher and distilling it into the base policy provides a beneficial warm start. This research is detailed in arXiv preprint 2604.15577v1, which was noted as a cross-type publication. The method specifically addresses outputs summarized by attribute vectors, such as helpfulness versus harmlessness or bio-availability versus lipophilicity, with arbitrary reward functions reflecting trade-offs between these characteristics.

Key facts

Reward weighted classifier-free guidance (RCFG) acts as a policy improvement operator in autoregressive models
RCFG approximates tilting the sampling distribution by the Q function
The method enables optimization of novel reward functions at test time without retraining
Applied successfully to molecular generation tasks
Using RCFG as a teacher and distilling into the base policy serves as a warm start
Traditional reinforcement learning requires retraining when reward functions change
Autoregressive models produce outputs summarized by attribute vectors
Arbitrary reward functions encode tradeoffs between properties like helpfulness vs. harmlessness

Reward Weighted Classifier-Free Guidance Enables Test-Time Optimization in Autoregressive Models

Key facts

Entities

Institutions

Sources