SD-Search: Self-Distillation for Search-Augmented Reasoning

other · 2026-05-20

A new method called SD-Search improves search-augmented reasoning agents by providing step-level supervision without external teachers. The approach uses on-policy hindsight self-distillation, where a single model acts as both student and teacher, differing only in conditioning. This addresses the credit assignment problem in outcome-reward reinforcement learning, where individual queries lack step-specific rewards. SD-Search requires no additional annotations or larger models.

Key facts

SD-Search derives step-level supervision from the policy itself through on-policy hindsight self-distillation
It requires neither an external teacher nor additional annotations
A single model plays two roles: student and teacher
The student sees only the context available at inference time
The teacher has access to additional information
Addresses the credit assignment problem in search-augmented reasoning
Improves performance of search-augmented reasoning agents

Entities

—

Sources

arXiv cs.AI — 2026-05-19