Feedback Distillation Improves Lean4 Theorem Proving
A novel training approach known as Feedback Distillation improves reasoning models utilized in Lean4 theorem proving. This technique enables the model to align its distribution based on privileged feedback from a language model, providing both token-level supervision and the integration of external knowledge. When contrasted with GRPO, Feedback Distillation exhibits superior trajectory diversity, increased policy entropy, and enhanced pass@k scaling. The two techniques work well together; initializing GRPO from a Feedback Distillation checkpoint yields better results than using either method independently.
Key facts
- Feedback Distillation is proposed for post-training reasoning models.
- It uses token-level supervision from a language model's privileged feedback.
- The method is evaluated on Lean4 theorem proving.
- It maintains greater diversity in generated trajectories than GRPO.
- Feedback Distillation yields higher policy entropy and better pass@k scaling.
- Initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone.
- The approach builds upon recent works on self-distillation.
- The paper is available on arXiv under identifier 2605.30861.
Entities
Institutions
- arXiv