AI Agentic Proving Achieves 98% Success in Program Verification
A new study evaluates Claude Code's agentic proving capabilities on the CLEVER benchmark, a Lean 4 dataset for verifiable code generation. The AI generates valid specifications for 98.8% of problems, certifies implementations against correct ground-truth specifications for 87.5%, and achieves a 98.1% success rate on end-to-end program generation and verification. Claude also provides high-quality feedback on its own attempts, identifying failure causes and bugs in the dataset. The research was published on arXiv (2605.23772) and demonstrates state-of-the-art performance in automated theorem proving for program verification.
Key facts
- Claude Code evaluated on CLEVER benchmark for program verification
- 98.8% of problems received valid specifications
- 81.3% accepted by CLEVER's isomorphism-based scoring
- 87.5% certification rate against correct ground-truth specifications
- 98.1% success rate on end-to-end pipeline with self-consistent premises
- Claude provides high-quality feedback on its own attempts
- Research published on arXiv (2605.23772)
- Agentic systems are state-of-the-art for automated theorem proving
Entities
Institutions
- arXiv
- CLEVER