AI Agentic Proving Achieves 98% Success in Program Verification

ai-technology · 2026-05-25

A new study evaluates Claude Code's agentic proving capabilities on the CLEVER benchmark, a Lean 4 dataset for verifiable code generation. The AI generates valid specifications for 98.8% of problems, certifies implementations against correct ground-truth specifications for 87.5%, and achieves a 98.1% success rate on end-to-end program generation and verification. Claude also provides high-quality feedback on its own attempts, identifying failure causes and bugs in the dataset. The research was published on arXiv (2605.23772) and demonstrates state-of-the-art performance in automated theorem proving for program verification.

Key facts

Claude Code evaluated on CLEVER benchmark for program verification
98.8% of problems received valid specifications
81.3% accepted by CLEVER's isomorphism-based scoring
87.5% certification rate against correct ground-truth specifications
98.1% success rate on end-to-end pipeline with self-consistent premises
Claude provides high-quality feedback on its own attempts
Research published on arXiv (2605.23772)
Agentic systems are state-of-the-art for automated theorem proving

AI Agentic Proving Achieves 98% Success in Program Verification

Key facts

Entities

Institutions

Sources