ARTFEED — Contemporary Art Intelligence

AI Agentic Proving Achieves 98% Success in Program Verification

ai-technology · 2026-05-25

A new study evaluates Claude Code's agentic proving capabilities on the CLEVER benchmark, a Lean 4 dataset for verifiable code generation. The AI generates valid specifications for 98.8% of problems, certifies implementations against correct ground-truth specifications for 87.5%, and achieves a 98.1% success rate on end-to-end program generation and verification. Claude also provides high-quality feedback on its own attempts, identifying failure causes and bugs in the dataset. The research was published on arXiv (2605.23772) and demonstrates state-of-the-art performance in automated theorem proving for program verification.

Key facts

  • Claude Code evaluated on CLEVER benchmark for program verification
  • 98.8% of problems received valid specifications
  • 81.3% accepted by CLEVER's isomorphism-based scoring
  • 87.5% certification rate against correct ground-truth specifications
  • 98.1% success rate on end-to-end pipeline with self-consistent premises
  • Claude provides high-quality feedback on its own attempts
  • Research published on arXiv (2605.23772)
  • Agentic systems are state-of-the-art for automated theorem proving

Entities

Institutions

  • arXiv
  • CLEVER

Sources