Feedback Distillation Improves Lean4 Theorem Proving

ai-technology · 2026-06-01

A novel training approach known as Feedback Distillation improves reasoning models utilized in Lean4 theorem proving. This technique enables the model to align its distribution based on privileged feedback from a language model, providing both token-level supervision and the integration of external knowledge. When contrasted with GRPO, Feedback Distillation exhibits superior trajectory diversity, increased policy entropy, and enhanced pass@k scaling. The two techniques work well together; initializing GRPO from a Feedback Distillation checkpoint yields better results than using either method independently.

Key facts

Feedback Distillation is proposed for post-training reasoning models.
It uses token-level supervision from a language model's privileged feedback.
The method is evaluated on Lean4 theorem proving.
It maintains greater diversity in generated trajectories than GRPO.
Feedback Distillation yields higher policy entropy and better pass@k scaling.
Initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone.
The approach builds upon recent works on self-distillation.
The paper is available on arXiv under identifier 2605.30861.

Feedback Distillation Improves Lean4 Theorem Proving

Key facts

Entities

Institutions

Sources