ARTFEED — Contemporary Art Intelligence

AntiSD: New RL Method Reverses Self-Distillation for Math Reasoning

ai-technology · 2026-05-13

A new reinforcement learning method called Anti-Self-Distillation (AntiSD) is proposed to improve reasoning in large language models, specifically for math tasks. The approach addresses failures in on-policy self-distillation, where a student model learns from a copy of itself conditioned on privileged context like verified solutions. Using pointwise mutual information analysis, researchers found that privileged context inflates teacher confidence on structural tokens (e.g., connectives, verifiable claims) and deflates it on deliberation tokens (e.g., 'Wait', 'Let', 'Maybe') crucial for multi-step search. AntiSD reverses the divergence direction, ascending rather than descending between student and teacher, yielding a naturally bounded advantage per token. An entropy-triggered gate disables the term when teacher entropy collapses. The method is detailed in arXiv paper 2605.11609, authored by researchers from an unspecified institution.

Key facts

  • AntiSD stands for Anti-Self-Distillation
  • Method targets reasoning reinforcement learning for math
  • On-policy self-distillation uses a copy of the student as teacher
  • Privileged context includes verified solutions or feedback
  • Pointwise mutual information analysis identified token-level issues
  • Structural tokens (connectives, verifiable claims) get inflated confidence
  • Deliberation tokens ('Wait', 'Let', 'Maybe') get deflated confidence
  • AntiSD ascends divergence instead of descending it
  • Entropy-triggered gate prevents collapse
  • Paper published on arXiv with ID 2605.11609

Entities

Institutions

  • arXiv

Sources