Logit Averaging Enhances LLM Post-Training Without KL Regularization

ai-technology · 2026-05-22

Researchers have unveiled a novel approach for fine-tuning large language models. This technique combines the logits of a stable reference policy, like Supervised Fine-Tuning (SFT), with a more adaptable policy, using Group Relative Policy Optimization (GRPO). Unlike Reinforcement Learning with Verifiable Rewards (RLVR), it eliminates the need for Kullback-Leibler (KL) regularization and critic networks. By averaging logits, this framework links the adaptable policy with the reference, leveraging the reasoning skills of the trainable policy while preserving the advantages of SFT. When evaluated against benchmarks like MATH, cn-k12, and MMLU, this new method shows accuracy that is either on par with or better than traditional KL-regularized GRPO.

Key facts

Method averages logits of frozen reference policy and trainable policy.
Incorporated into Group Relative Policy Optimization (GRPO).
No KL regularization or critic network used.
Evaluated on MATH, cn-k12, and MMLU benchmarks.
Achieves higher or comparable accuracy to KL-regularized GRPO.
Reference policy is typically an SFT model.
Method leverages reasoning expertise while maintaining SFT formatting.
Contrasts with Reinforcement Learning with Verifiable Rewards (RLVR).

Logit Averaging Enhances LLM Post-Training Without KL Regularization

Key facts

Entities

Institutions

Sources