ARTFEED — Contemporary Art Intelligence

Preference-Based Self-Distillation Improves On-Policy Training

other · 2026-05-07

A new method called Preference-Based Self-Distillation (PBSD) is proposed to address limitations in on-policy self-distillation for language models. Existing self-distillation approaches reduce learning to KL matching toward a context-augmented teacher model, which can cause training instability and degrade reasoning performance over time. Additionally, self-distillation from the same model with prompt augmentation lacks exploratory diversity. PBSD moves beyond fixed-teacher KL matching by revisiting on-policy self-distillation through a reward-regularized perspective. The method is introduced in a paper on arXiv (2605.05040).

Key facts

  • Preference-Based Self-Distillation (PBSD) is proposed.
  • PBSD addresses limitations of existing self-distillation methods.
  • Existing methods reduce learning to KL matching toward context-augmented teacher.
  • KL matching can cause training instability and degrade reasoning.
  • Self-distillation from same model lacks exploratory diversity.
  • PBSD uses a reward-regularized perspective.
  • Paper available on arXiv with ID 2605.05040.

Entities

Institutions

  • arXiv

Sources