MathlibPR: Benchmark for LLM-Assisted Review of Formal Math Libraries

other · 2026-05-11

Researchers have introduced MathlibPR, a benchmark built from real pull request histories of Mathlib4, to evaluate whether large language models (LLMs) can assist in reviewing contributions to formal mathematical libraries. The Lean and Mathlib ecosystem has become the standard for LLM-assisted formal reasoning, but its growth is bottlenecked by human review. The benchmark tests both LLM models (DeepSeek, Qwen, Goedel, Kimina) and LLM agents (Codex, Claude Code) on judging PR readiness. Initial results show that both models and agents struggle to distinguish merge-ready PRs from those needing changes.

Key facts

MathlibPR is a benchmark built from real Mathlib4 pull request histories
The Lean and Mathlib ecosystem is the de facto standard for LLM-assisted formal reasoning
Mathlib's growth is bottlenecked by the human review process
LLMs are evaluated on their ability to judge whether PRs follow Mathlib's conventions
Models tested include DeepSeek, Qwen, Goedel, and Kimina
Agents tested include Codex and Claude Code
Both LLM models and agents struggle to distinguish merge-ready PRs
The benchmark proposes a staged evaluation protocol

MathlibPR: Benchmark for LLM-Assisted Review of Formal Math Libraries

Key facts

Entities

Institutions

Sources