SDAR: Self-Distilled Agentic Reinforcement Learning for LLMs
A novel approach known as SDAR (Self-Distilled Agentic Reinforcement Learning) has been developed to enhance the post-training phase of large language model (LLM) agents. This method tackles the shortcomings of reinforcement learning (RL), which typically offers only broad trajectory-level rewards for extended tasks. SDAR is based on On-Policy Self-Distillation (OPSD), which incorporates detailed token-level guidance from a teacher branch with privileged context; however, OPSD faces challenges related to instability in multi-turn scenarios and negative teacher rejections. In SDAR, OPSD is utilized as a gated auxiliary objective while maintaining RL as the main optimization framework. It employs a sigmoid gate on detached token-level signals to reinforce distillation on positively endorsed tokens and gently reduce others. The research can be found on arXiv under ID 2605.15155.
Key facts
- SDAR stands for Self-Distilled Agentic Reinforcement Learning.
- It is designed for post-training LLM agents.
- RL provides only coarse trajectory-level rewards.
- OPSD adds dense token-level guidance from a teacher branch.
- OPSD suffers from multi-turn instability and negative teacher rejections.
- SDAR uses a gated auxiliary objective with RL as primary.
- A sigmoid gate strengthens distillation on teacher-endorsed tokens.
- Paper available on arXiv: 2605.15155.
Entities
Institutions
- arXiv