MathlibPR: Benchmark for LLM-Assisted Review of Formal Math Libraries
Researchers have introduced MathlibPR, a benchmark built from real pull request histories of Mathlib4, to evaluate whether large language models (LLMs) can assist in reviewing contributions to formal mathematical libraries. The Lean and Mathlib ecosystem has become the standard for LLM-assisted formal reasoning, but its growth is bottlenecked by human review. The benchmark tests both LLM models (DeepSeek, Qwen, Goedel, Kimina) and LLM agents (Codex, Claude Code) on judging PR readiness. Initial results show that both models and agents struggle to distinguish merge-ready PRs from those needing changes.
Key facts
- MathlibPR is a benchmark built from real Mathlib4 pull request histories
- The Lean and Mathlib ecosystem is the de facto standard for LLM-assisted formal reasoning
- Mathlib's growth is bottlenecked by the human review process
- LLMs are evaluated on their ability to judge whether PRs follow Mathlib's conventions
- Models tested include DeepSeek, Qwen, Goedel, and Kimina
- Agents tested include Codex and Claude Code
- Both LLM models and agents struggle to distinguish merge-ready PRs
- The benchmark proposes a staged evaluation protocol
Entities
Institutions
- Mathlib
- Mathlib4
- Lean