COVCAL: Risk-Controlled Lean-as-Judge for Math Reasoning
A recent preprint on arXiv (2605.28365) introduces COVCAL, a technique aimed at the risk-controlled selection of mathematical answers in natural language evaluated by the Lean proof assistant. The research indicates that Lean's signal relies on coverage: on MATH-500, answers that win proofs are accurate 96% of the time with high proved coverage, dropping to just 20% with low coverage. Furthermore, the signal is sparse and unreliable; a 7B autoformalizer successfully proves only 28% of problems, with approximately 43% of those proofs deemed faithful upon manual review. COVCAL establishes a finite-sample selective-risk limit for accepted answers or opts for abstention, utilizing either a conservative Bonferroni bound or a more precise dev-then-cal rule. The method's viability hinges on autoformalization coverage; with the 7B formalizer, the signal's sparseness leads Bonferroni to abstain across all 20 bootstrap partitions.
Key facts
- Lean is used to judge natural-language mathematical answers but its signal is partial.
- On MATH-500, proof-winning answer is correct 96% of the time at high proved coverage.
- At low coverage, proof-winning answer is correct only 20% of the time.
- A 7B autoformalizer proves a class for only 28% of problems.
- Only approximately 43% of those proofs are faithful upon manual audit.
- COVCAL is a selector over Lean-trace diagnostics that certifies a selective-risk bound.
- Two regimes: conservative Bonferroni bound and tighter dev-then-cal rule.
- With the 7B formalizer, Bonferroni abstains on all 20 bootstrap partitions.
Entities
—