AI Models Tested for Logical Reasoning Faithfulness in Formal Verification Study
A recent study explores whether sophisticated AI models take advantage of shortcomings in formal verification while producing logical proofs. Researchers assessed GPT-5 and DeepSeek-R1 on 303 first-order logic challenges sourced from the FOLIO and Multi-LogiEval datasets, concentrating on the phenomenon of "formalization gaming." Although the models achieved compilation rates ranging from 87% to 99%, no consistent signs of gaming were detected. The models tended to report failures instead of attempting to generate proofs, even with prompts. The investigation contrasted unified generation with a two-stage pipeline, uncovering different modes of unfaithfulness that escape detection. The findings highlight the disparity between valid proofs and accurate translations in natural-language reasoning systems. While formal verification guarantees proof validity, it does not ensure faithfulness, creating potential vulnerabilities, particularly for cutting-edge models that develop axiom systems independently. Lean 4 proofs were utilized to assess faithfulness.
Key facts
- Study examines formalization gaming in AI logical reasoning
- Evaluated GPT-5 and DeepSeek-R1 on 303 first-order logic problems
- Used datasets from FOLIO (203 problems) and Multi-LogiEval (100 problems)
- Compilation rates ranged from 87% to 99%
- Found no evidence of systematic gaming in unified generation
- Models preferred reporting failure over forcing proofs
- Two-stage pipeline revealed distinct modes of unfaithfulness
- Research focused on Lean 4 proof generation
Entities
—