Frost Training Boosts LLM-as-a-Judge Performance
A new method called Frost Training improves Monte Carlo-based policy optimization for LLM-as-a-judge tasks, specifically Cross-Entropy Games. The technique exploits the gradient of the reward function in embedding space, a signal previously used in the Greedy Coordinate Gradient (GCG) jailbreaking technique. For the first time, this gradient is applied to boost model training. Validation using GRPO training for maximum-likelihood infilling shows that Frost Training enhances the model's ability to generate high-scoring outputs, achieving higher maximum scores in a best-of-k setting with increased speed. The research is published on arXiv.
Key facts
- Frost Training is a method for improving Monte Carlo-based policy optimization.
- It targets a family of LLM-as-a-judge tasks called Cross-Entropy Games.
- The method exploits the gradient of the reward function in embedding space.
- This gradient was previously used in the GCG jailbreaking technique.
- It is the first demonstration of using this gradient for model training.
- Validation used GRPO training for maximum-likelihood infilling.
- Frost Training yields higher maximum scores in best-of-k settings.
- The method increases the speed of achieving high-scoring outputs.
Entities
Institutions
- arXiv