ProofGrid Benchmark Tests LLM Reasoning with Machine-Checkable Proofs
Researchers have introduced a new benchmark suite named ProofGrid, which assesses the reasoning capabilities of large language models through machine-checkable proofs instead of final answers. Detailed in arXiv paper 2605.12524, ProofGrid includes 15 distinct tasks that focus on proof writing, checking, masking, and gap-filling. The tasks utilize minimal formal notation, particularly NDL, a concise natural-deduction language that fits within brief prompts, allowing for accurate and verifiable assessments. This framework enables reproducible evaluations without the influence of human or LLM judgment. ProofGrid offers a range of difficulties, from basic tests to challenging tasks unsolved by current models, while reducing dependence on domain knowledge and lengthy context artifacts. Additionally, the authors propose a comparative framework for reasoning benchmarks, positioning ProofGrid against existing methodologies in terms of representation and verification.
Key facts
- ProofGrid is a benchmark suite for evaluating LLM reasoning through machine-checkable proofs.
- It contains 15 tasks spanning proof writing, checking, masking, and gap-filling.
- Tasks are expressed in minimal formal notation, especially NDL.
- NDL is a compact natural-deduction language that fits in short prompts.
- The evaluation is mechanical, reproducible, and fine-grained.
- ProofGrid covers a calibrated difficulty spectrum from foundational to challenge tasks.
- No current model solves the challenge tasks.
- The authors developed a comparative framework for reasoning benchmarks.
Entities
Institutions
- arXiv