LGMT: Logic-Grounded Metamorphic Testing for LLM Reasoning Reliability

ai-technology · 2026-05-26

Researchers have introduced a novel framework named LGMT (Logic-Grounded Metamorphic Testing) aimed at assessing the reasoning reliability of Large Language Models (LLMs). This framework, detailed in a publication on arXiv (2605.23965), employs first-order logic (FOL) to generate metamorphic relations derived from formal logical equivalences, resulting in semantically invariant test cases. Unlike conventional static benchmarks, LGMT identifies reasoning flaws via cross-case consistency checks, eliminating the need for ground-truth labels. Testing on six leading LLMs uncovered significant hidden defects overlooked by reference-based assessments. The findings indicate that these models are especially vulnerable to variations at the symbol and conclusion levels, with advanced prompting strategies like Few-shot CoT only partially alleviating these challenges.

Key facts

LGMT stands for Logic-Grounded Metamorphic Testing.
It is an oracle-free framework for evaluating LLM reasoning.
LGMT leverages first-order logic (FOL) to derive metamorphic relations.
It constructs semantically invariant test cases from logical equivalences.
Defects are detected through cross-case consistency checking.
Experiments were conducted on six state-of-the-art LLMs.
LGMT exposed hidden defects missed by traditional evaluations.
Models are sensitive to symbol-level and conclusion-level variations.

LGMT: Logic-Grounded Metamorphic Testing for LLM Reasoning Reliability

Key facts

Entities

Institutions

Sources