Hindsight Hint Distillation Boosts SWE Agents Without Chain-of-Thought Data
Researchers propose Hindsight Hint Distillation (HHD), a method that improves software engineering (SWE) agents' planning and reasoning without requiring costly chain-of-thought (CoT) annotations. HHD uses only easy-to-obtain question-answer pairs, synthesizing hindsight hints from the model's own failed self-rollouts to scaffold on-policy rollouts that complete tasks. The model then self-distills these trajectories and generalizes to new problems without hints. Experiments on SWE-bench Verified show HHD achieves an absolute improvement of 8%, significantly outperforming iterative RFT and trajectory-synthesis baselines, which improve by only about 2%.
Key facts
- HHD requires only question-answer pairs, not CoT annotations.
- Hindsight hints are synthesized from the model's own failed self-rollouts.
- The method scaffolds on-policy rollouts that successfully complete tasks.
- Model self-distills scaffolded trajectories and generalizes without hints.
- HHD achieves 8% absolute improvement on SWE-bench Verified.
- Baselines (iterative RFT, trajectory-synthesis) improve by only ~2%.
- The paper is published on arXiv with ID 2605.11556.
- HHD is inspired by how human teachers use student mistakes for guidance.
Entities
Institutions
- arXiv