AntiSD: New RL Method Reverses Self-Distillation for Math Reasoning

ai-technology · 2026-05-13

A new reinforcement learning method called Anti-Self-Distillation (AntiSD) is proposed to improve reasoning in large language models, specifically for math tasks. The approach addresses failures in on-policy self-distillation, where a student model learns from a copy of itself conditioned on privileged context like verified solutions. Using pointwise mutual information analysis, researchers found that privileged context inflates teacher confidence on structural tokens (e.g., connectives, verifiable claims) and deflates it on deliberation tokens (e.g., 'Wait', 'Let', 'Maybe') crucial for multi-step search. AntiSD reverses the divergence direction, ascending rather than descending between student and teacher, yielding a naturally bounded advantage per token. An entropy-triggered gate disables the term when teacher entropy collapses. The method is detailed in arXiv paper 2605.11609, authored by researchers from an unspecified institution.

Key facts

AntiSD stands for Anti-Self-Distillation
Method targets reasoning reinforcement learning for math
On-policy self-distillation uses a copy of the student as teacher
Privileged context includes verified solutions or feedback
Pointwise mutual information analysis identified token-level issues
Structural tokens (connectives, verifiable claims) get inflated confidence
Deliberation tokens ('Wait', 'Let', 'Maybe') get deflated confidence
AntiSD ascends divergence instead of descending it
Entropy-triggered gate prevents collapse
Paper published on arXiv with ID 2605.11609

AntiSD: New RL Method Reverses Self-Distillation for Math Reasoning

Key facts

Entities

Institutions

Sources