LGMT: Logic-Grounded Metamorphic Testing for LLM Reasoning Reliability
Researchers have introduced a novel framework named LGMT (Logic-Grounded Metamorphic Testing) aimed at assessing the reasoning reliability of Large Language Models (LLMs). This framework, detailed in a publication on arXiv (2605.23965), employs first-order logic (FOL) to generate metamorphic relations derived from formal logical equivalences, resulting in semantically invariant test cases. Unlike conventional static benchmarks, LGMT identifies reasoning flaws via cross-case consistency checks, eliminating the need for ground-truth labels. Testing on six leading LLMs uncovered significant hidden defects overlooked by reference-based assessments. The findings indicate that these models are especially vulnerable to variations at the symbol and conclusion levels, with advanced prompting strategies like Few-shot CoT only partially alleviating these challenges.
Key facts
- LGMT stands for Logic-Grounded Metamorphic Testing.
- It is an oracle-free framework for evaluating LLM reasoning.
- LGMT leverages first-order logic (FOL) to derive metamorphic relations.
- It constructs semantically invariant test cases from logical equivalences.
- Defects are detected through cross-case consistency checking.
- Experiments were conducted on six state-of-the-art LLMs.
- LGMT exposed hidden defects missed by traditional evaluations.
- Models are sensitive to symbol-level and conclusion-level variations.
Entities
Institutions
- arXiv