LLMs Struggle with SAT Reasoning Despite High Accuracy

ai-technology · 2026-05-28

A new study from arXiv reveals that large language models (LLMs) fail at Boolean satisfiability (SAT) reasoning despite scoring high on conventional metrics. The research systematically evaluates LLMs on 2-SAT and 3-SAT problems, along with Vertex Cover and discrete 3D packing reductions. Standard metrics like accuracy, precision, recall, and F1 are misleading because models over-predict satisfiable formulas and cannot reproduce the classical easy-hard-easy signature around the 3-SAT threshold. Performance degrades sharply as the number of variables increases. To address this, the authors introduce a paired-formula protocol using minimally different satisfiable and unsatisfiable instances, along with Accurate Differentiation Rate (ADR), which requires models to distinguish between them. The findings highlight fundamental limitations in LLM reasoning capabilities.

Key facts

Study evaluates LLMs on 2-SAT and 3-SAT problems
Includes Vertex Cover and discrete 3D packing reductions
Conventional metrics like accuracy and F1 are misleading
Models over-predict satisfiable formulas
Cannot reproduce easy-hard-easy signature around 3-SAT threshold
Performance degrades sharply with more variables
Introduces paired-formula protocol with ADR metric
Published on arXiv with ID 2605.28602

LLMs Struggle with SAT Reasoning Despite High Accuracy

Key facts

Entities

Institutions

Sources