AMR-SD: New Self-Distillation Method for LLM Token-Level Credit Assignment
A new method called Asymmetric Meta-Reflective Self-Distillation (AMR-SD) has been proposed to improve token-level credit assignment in reinforcement learning for large language models. Standard algorithms like GRPO apply sequence-level rewards uniformly, causing a credit-assignment bottleneck. On-policy self-distillation attempts to address this but suffers from over-conditioned teacher distributions and training collapse due to direct exposure to raw oracle solutions. AMR-SD inserts a reflection bottleneck that compresses diagnostic signals from verifier outcomes, peer rollouts, or reference feedback into concise, self-generated Socratic hints and critiques. The method also introduces Causal Information Gain to further enhance learning. The paper is available on arXiv under identifier 2605.18529.
Key facts
- AMR-SD addresses credit-assignment bottleneck in LLM reinforcement learning.
- Standard GRPO uses sequence-level rewards uniformly.
- On-policy self-distillation causes over-conditioned teacher distributions and training collapse.
- AMR-SD inserts a reflection bottleneck to compress diagnostic signals.
- Diagnostic signals come from verifier outcomes, peer rollouts, or reference feedback.
- Signals are compressed into self-generated Socratic hints and critiques.
- Causal Information Gain is introduced as part of the method.
- Paper available on arXiv at identifier 2605.18529.
Entities
Institutions
- arXiv