Preference-Based Self-Distillation Improves On-Policy Training

other · 2026-05-07

A new method called Preference-Based Self-Distillation (PBSD) is proposed to address limitations in on-policy self-distillation for language models. Existing self-distillation approaches reduce learning to KL matching toward a context-augmented teacher model, which can cause training instability and degrade reasoning performance over time. Additionally, self-distillation from the same model with prompt augmentation lacks exploratory diversity. PBSD moves beyond fixed-teacher KL matching by revisiting on-policy self-distillation through a reward-regularized perspective. The method is introduced in a paper on arXiv (2605.05040).

Key facts

Preference-Based Self-Distillation (PBSD) is proposed.
PBSD addresses limitations of existing self-distillation methods.
Existing methods reduce learning to KL matching toward context-augmented teacher.
KL matching can cause training instability and degrade reasoning.
Self-distillation from same model lacks exploratory diversity.
PBSD uses a reward-regularized perspective.
Paper available on arXiv with ID 2605.05040.

Preference-Based Self-Distillation Improves On-Policy Training

Key facts

Entities

Institutions

Sources