SC-SDPO: Scale-Consistent Self-Distillation Improves LLM Reasoning

other · 2026-05-28

A new method called SC-SDPO (Scale-Consistent Self-Distillation Policy Optimization) enhances reasoning in large language models by addressing a limitation in SDPO. SDPO uses the model's own predictions as a teacher for dense token-level credit assignment, but lacks difficulty awareness compared to GRPO, which naturally focuses on intermediate-difficulty questions. By analyzing GRPO's advantage normalization, researchers found that normalization equalizes learnability across questions, leaving a residual scaling factor. They propose weighting each question's SDPO loss by [p̂(1-p̂)]^{1/2}, where p̂ is the estimated pass rate, creating SC-SDPO. This scale-consistent variant improves performance on reasoning tasks. The work is published on arXiv under identifier 2605.27765.

Key facts

SC-SDPO is a variant of Self-Distillation Policy Optimization (SDPO).
SDPO uses the model's own feedback-conditioned predictions as a self-teacher.
GRPO's group-relative advantage naturally focuses on intermediate-difficulty questions.
SDPO's KL-based advantage lacks implicit difficulty awareness.
Normalization absorbs variance term p(1-p), equalizing learnability across questions.
The residual scaling factor is sqrt(p(1-p)) in the per-question gradient.
SC-SDPO weights each question's SDPO loss by [p̂(1-p̂)]^{1/2}.
The paper is available on arXiv with ID 2605.27765.

SC-SDPO: Scale-Consistent Self-Distillation Improves LLM Reasoning

Key facts

Entities

Institutions

Sources