Frost Training Boosts LLM-as-a-Judge Performance

ai-technology · 2026-05-28

A new method called Frost Training improves Monte Carlo-based policy optimization for LLM-as-a-judge tasks, specifically Cross-Entropy Games. The technique exploits the gradient of the reward function in embedding space, a signal previously used in the Greedy Coordinate Gradient (GCG) jailbreaking technique. For the first time, this gradient is applied to boost model training. Validation using GRPO training for maximum-likelihood infilling shows that Frost Training enhances the model's ability to generate high-scoring outputs, achieving higher maximum scores in a best-of-k setting with increased speed. The research is published on arXiv.

Key facts

Frost Training is a method for improving Monte Carlo-based policy optimization.
It targets a family of LLM-as-a-judge tasks called Cross-Entropy Games.
The method exploits the gradient of the reward function in embedding space.
This gradient was previously used in the GCG jailbreaking technique.
It is the first demonstration of using this gradient for model training.
Validation used GRPO training for maximum-likelihood infilling.
Frost Training yields higher maximum scores in best-of-k settings.
The method increases the speed of achieving high-scoring outputs.

Frost Training Boosts LLM-as-a-Judge Performance

Key facts

Entities

Institutions

Sources