SOD: Step-wise On-policy Distillation for Small Language Model Agents
A novel approach known as SOD (Step-wise On-policy Distillation) tackles the shortcomings of on-policy distillation (OPD) in tool-integrated reasoning (TIR) specifically for small language models. While OPD provides comprehensive token-level guidance from a teacher on the trajectories produced by students, TIR experiences a ripple effect of errors from incorrect tool usage, increasing the divergence between student and teacher and rendering the supervision ineffective. SOD mitigates this by dynamically adjusting the distillation strength at each step according to the level of divergence, thereby curbing the spread of errors. This technique is tailored for small language model agents and seeks to enhance stability during prolonged tool interactions. The research can be found on arXiv under ID 2605.07725.
Key facts
- SOD stands for Step-wise On-policy Distillation.
- It targets small language model agents.
- Tool-integrated reasoning (TIR) is difficult to scale to small models due to instability and limited capacity.
- On-policy distillation (OPD) provides dense token-level supervision from a teacher.
- In TIR, OPD leads to cascading erroneous tool calls and increased student-teacher divergence.
- SOD adaptively reweights distillation strength per step based on step-level divergence.
- The paper is on arXiv with ID 2605.07725.
- The method aims to improve stability in long-horizon tool interactions.
Entities
Institutions
- arXiv