SDAR: Self-Distilled Agentic Reinforcement Learning for LLMs

ai-technology · 2026-05-16

A novel approach known as SDAR (Self-Distilled Agentic Reinforcement Learning) has been developed to enhance the post-training phase of large language model (LLM) agents. This method tackles the shortcomings of reinforcement learning (RL), which typically offers only broad trajectory-level rewards for extended tasks. SDAR is based on On-Policy Self-Distillation (OPSD), which incorporates detailed token-level guidance from a teacher branch with privileged context; however, OPSD faces challenges related to instability in multi-turn scenarios and negative teacher rejections. In SDAR, OPSD is utilized as a gated auxiliary objective while maintaining RL as the main optimization framework. It employs a sigmoid gate on detached token-level signals to reinforce distillation on positively endorsed tokens and gently reduce others. The research can be found on arXiv under ID 2605.15155.

Key facts

SDAR stands for Self-Distilled Agentic Reinforcement Learning.
It is designed for post-training LLM agents.
RL provides only coarse trajectory-level rewards.
OPSD adds dense token-level guidance from a teacher branch.
OPSD suffers from multi-turn instability and negative teacher rejections.
SDAR uses a gated auxiliary objective with RL as primary.
A sigmoid gate strengthens distillation on teacher-endorsed tokens.
Paper available on arXiv: 2605.15155.

SDAR: Self-Distilled Agentic Reinforcement Learning for LLMs

Key facts

Entities

Institutions

Sources