Chain-of-Thought Reasoning Fails Introspective Faithfulness Test in LLMs

other · 2026-05-28

A recent investigation published on arXiv examines whether chain-of-thought (CoT) reasoning in large language models accurately mirrors their internal decision-making when confronted with conflicting knowledge. The study introduces the term introspective faithfulness and evaluates it using 200 questions, 8 models, and 4 different prompt conditions. Findings indicate that CoT reasoning remains remarkably consistent across contradictory decisions, with flip pairs maintaining a 96% similarity in answers. Nonetheless, self-assessed confidence shows a slight authentic signal: for less well-known facts where entity prominence is not helpful, confidence still influences decisions (p<0.001) and correlates with item-level knowledge (r=0.134). GPT-4o stands out as the only model with a statistically significant connection between reasoning and decision-making. Claude Sonnet 4.6 exhibits the broadest confidence variation (SD=1.39) but shows nearly no pooled correlation. This research underscores the challenges of relying on CoT as a true representation of model behavior.

Key facts

Study introduces introspective faithfulness for chain-of-thought reasoning.
Tested across 200 questions, 8 models, and 4 prompt conditions.
CoT reasoning retains 96% of same-answer similarity in flip pairs.
Confidence predicts decisions for obscure facts (p<0.001).
GPT-4o is the only model with reliable reasoning-decision coupling.
Claude Sonnet 4.6 has widest confidence range (SD=1.39) but near-zero pooled correlation.
Paper published on arXiv with ID 2605.27773.

Chain-of-Thought Reasoning Fails Introspective Faithfulness Test in LLMs

Key facts

Entities

Institutions

Sources