ARTFEED — Contemporary Art Intelligence

Feedback Distillation Improves Lean4 Theorem Proving

ai-technology · 2026-06-01

A novel training approach known as Feedback Distillation improves reasoning models utilized in Lean4 theorem proving. This technique enables the model to align its distribution based on privileged feedback from a language model, providing both token-level supervision and the integration of external knowledge. When contrasted with GRPO, Feedback Distillation exhibits superior trajectory diversity, increased policy entropy, and enhanced pass@k scaling. The two techniques work well together; initializing GRPO from a Feedback Distillation checkpoint yields better results than using either method independently.

Key facts

  • Feedback Distillation is proposed for post-training reasoning models.
  • It uses token-level supervision from a language model's privileged feedback.
  • The method is evaluated on Lean4 theorem proving.
  • It maintains greater diversity in generated trajectories than GRPO.
  • Feedback Distillation yields higher policy entropy and better pass@k scaling.
  • Initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone.
  • The approach builds upon recent works on self-distillation.
  • The paper is available on arXiv under identifier 2605.30861.

Entities

Institutions

  • arXiv

Sources