SC-SDPO: Scale-Consistent Self-Distillation Improves LLM Reasoning
A new method called SC-SDPO (Scale-Consistent Self-Distillation Policy Optimization) enhances reasoning in large language models by addressing a limitation in SDPO. SDPO uses the model's own predictions as a teacher for dense token-level credit assignment, but lacks difficulty awareness compared to GRPO, which naturally focuses on intermediate-difficulty questions. By analyzing GRPO's advantage normalization, researchers found that normalization equalizes learnability across questions, leaving a residual scaling factor. They propose weighting each question's SDPO loss by [p̂(1-p̂)]^{1/2}, where p̂ is the estimated pass rate, creating SC-SDPO. This scale-consistent variant improves performance on reasoning tasks. The work is published on arXiv under identifier 2605.27765.
Key facts
- SC-SDPO is a variant of Self-Distillation Policy Optimization (SDPO).
- SDPO uses the model's own feedback-conditioned predictions as a self-teacher.
- GRPO's group-relative advantage naturally focuses on intermediate-difficulty questions.
- SDPO's KL-based advantage lacks implicit difficulty awareness.
- Normalization absorbs variance term p(1-p), equalizing learnability across questions.
- The residual scaling factor is sqrt(p(1-p)) in the per-question gradient.
- SC-SDPO weights each question's SDPO loss by [p̂(1-p̂)]^{1/2}.
- The paper is available on arXiv with ID 2605.27765.
Entities
Institutions
- arXiv