SD-Search: Self-Distillation for Search-Augmented Reasoning
A new method called SD-Search improves search-augmented reasoning agents by providing step-level supervision without external teachers. The approach uses on-policy hindsight self-distillation, where a single model acts as both student and teacher, differing only in conditioning. This addresses the credit assignment problem in outcome-reward reinforcement learning, where individual queries lack step-specific rewards. SD-Search requires no additional annotations or larger models.
Key facts
- SD-Search derives step-level supervision from the policy itself through on-policy hindsight self-distillation
- It requires neither an external teacher nor additional annotations
- A single model plays two roles: student and teacher
- The student sees only the context available at inference time
- The teacher has access to additional information
- Addresses the credit assignment problem in search-augmented reasoning
- Improves performance of search-augmented reasoning agents
Entities
—