ProofGrid Benchmark Tests LLM Reasoning with Machine-Checkable Proofs

ai-technology · 2026-05-14

Researchers have introduced a new benchmark suite named ProofGrid, which assesses the reasoning capabilities of large language models through machine-checkable proofs instead of final answers. Detailed in arXiv paper 2605.12524, ProofGrid includes 15 distinct tasks that focus on proof writing, checking, masking, and gap-filling. The tasks utilize minimal formal notation, particularly NDL, a concise natural-deduction language that fits within brief prompts, allowing for accurate and verifiable assessments. This framework enables reproducible evaluations without the influence of human or LLM judgment. ProofGrid offers a range of difficulties, from basic tests to challenging tasks unsolved by current models, while reducing dependence on domain knowledge and lengthy context artifacts. Additionally, the authors propose a comparative framework for reasoning benchmarks, positioning ProofGrid against existing methodologies in terms of representation and verification.

Key facts

ProofGrid is a benchmark suite for evaluating LLM reasoning through machine-checkable proofs.
It contains 15 tasks spanning proof writing, checking, masking, and gap-filling.
Tasks are expressed in minimal formal notation, especially NDL.
NDL is a compact natural-deduction language that fits in short prompts.
The evaluation is mechanical, reproducible, and fine-grained.
ProofGrid covers a calibrated difficulty spectrum from foundational to challenge tasks.
No current model solves the challenge tasks.
The authors developed a comparative framework for reasoning benchmarks.

ProofGrid Benchmark Tests LLM Reasoning with Machine-Checkable Proofs

Key facts

Entities

Institutions

Sources