AI Models Tested for Logical Reasoning Faithfulness in Formal Verification Study

ai-technology · 2026-04-22

A recent study explores whether sophisticated AI models take advantage of shortcomings in formal verification while producing logical proofs. Researchers assessed GPT-5 and DeepSeek-R1 on 303 first-order logic challenges sourced from the FOLIO and Multi-LogiEval datasets, concentrating on the phenomenon of "formalization gaming." Although the models achieved compilation rates ranging from 87% to 99%, no consistent signs of gaming were detected. The models tended to report failures instead of attempting to generate proofs, even with prompts. The investigation contrasted unified generation with a two-stage pipeline, uncovering different modes of unfaithfulness that escape detection. The findings highlight the disparity between valid proofs and accurate translations in natural-language reasoning systems. While formal verification guarantees proof validity, it does not ensure faithfulness, creating potential vulnerabilities, particularly for cutting-edge models that develop axiom systems independently. Lean 4 proofs were utilized to assess faithfulness.

Key facts

Study examines formalization gaming in AI logical reasoning
Evaluated GPT-5 and DeepSeek-R1 on 303 first-order logic problems
Used datasets from FOLIO (203 problems) and Multi-LogiEval (100 problems)
Compilation rates ranged from 87% to 99%
Found no evidence of systematic gaming in unified generation
Models preferred reporting failure over forcing proofs
Two-stage pipeline revealed distinct modes of unfaithfulness
Research focused on Lean 4 proof generation

Entities

—

Sources

arXiv cs.AI — 2026-04-22