ARTFEED — Contemporary Art Intelligence

LLMs Struggle with SAT Reasoning Despite High Accuracy

ai-technology · 2026-05-28

A new study from arXiv reveals that large language models (LLMs) fail at Boolean satisfiability (SAT) reasoning despite scoring high on conventional metrics. The research systematically evaluates LLMs on 2-SAT and 3-SAT problems, along with Vertex Cover and discrete 3D packing reductions. Standard metrics like accuracy, precision, recall, and F1 are misleading because models over-predict satisfiable formulas and cannot reproduce the classical easy-hard-easy signature around the 3-SAT threshold. Performance degrades sharply as the number of variables increases. To address this, the authors introduce a paired-formula protocol using minimally different satisfiable and unsatisfiable instances, along with Accurate Differentiation Rate (ADR), which requires models to distinguish between them. The findings highlight fundamental limitations in LLM reasoning capabilities.

Key facts

  • Study evaluates LLMs on 2-SAT and 3-SAT problems
  • Includes Vertex Cover and discrete 3D packing reductions
  • Conventional metrics like accuracy and F1 are misleading
  • Models over-predict satisfiable formulas
  • Cannot reproduce easy-hard-easy signature around 3-SAT threshold
  • Performance degrades sharply with more variables
  • Introduces paired-formula protocol with ADR metric
  • Published on arXiv with ID 2605.28602

Entities

Institutions

  • arXiv

Sources