Preference-Based Self-Distillation Improves On-Policy Training
A new method called Preference-Based Self-Distillation (PBSD) is proposed to address limitations in on-policy self-distillation for language models. Existing self-distillation approaches reduce learning to KL matching toward a context-augmented teacher model, which can cause training instability and degrade reasoning performance over time. Additionally, self-distillation from the same model with prompt augmentation lacks exploratory diversity. PBSD moves beyond fixed-teacher KL matching by revisiting on-policy self-distillation through a reward-regularized perspective. The method is introduced in a paper on arXiv (2605.05040).
Key facts
- Preference-Based Self-Distillation (PBSD) is proposed.
- PBSD addresses limitations of existing self-distillation methods.
- Existing methods reduce learning to KL matching toward context-augmented teacher.
- KL matching can cause training instability and degrade reasoning.
- Self-distillation from same model lacks exploratory diversity.
- PBSD uses a reward-regularized perspective.
- Paper available on arXiv with ID 2605.05040.
Entities
Institutions
- arXiv