ARTFEED — Contemporary Art Intelligence

Frost Training Boosts LLM-as-a-Judge Performance

ai-technology · 2026-05-28

A new method called Frost Training improves Monte Carlo-based policy optimization for LLM-as-a-judge tasks, specifically Cross-Entropy Games. The technique exploits the gradient of the reward function in embedding space, a signal previously used in the Greedy Coordinate Gradient (GCG) jailbreaking technique. For the first time, this gradient is applied to boost model training. Validation using GRPO training for maximum-likelihood infilling shows that Frost Training enhances the model's ability to generate high-scoring outputs, achieving higher maximum scores in a best-of-k setting with increased speed. The research is published on arXiv.

Key facts

  • Frost Training is a method for improving Monte Carlo-based policy optimization.
  • It targets a family of LLM-as-a-judge tasks called Cross-Entropy Games.
  • The method exploits the gradient of the reward function in embedding space.
  • This gradient was previously used in the GCG jailbreaking technique.
  • It is the first demonstration of using this gradient for model training.
  • Validation used GRPO training for maximum-likelihood infilling.
  • Frost Training yields higher maximum scores in best-of-k settings.
  • The method increases the speed of achieving high-scoring outputs.

Entities

Institutions

  • arXiv

Sources