New Framework Detects Unfaithful Chain-of-Thought Reasoning in LLMs
Researchers have introduced CIE-Scorer, a framework for detecting unfaithful chain-of-thought (CoT) reasoning in large language models (LLMs). CoT reasoning improves problem-solving but generated traces may not reflect the model's actual decision process. Existing detectors rely on external signals like textual plausibility or answer consistency, ignoring internal computation. Circuit tracing methods provide internal evidence but are costly for long CoTs. CIE-Scorer uses a circuit-guided internal-external discrepancy approach to scale detection. The framework scores instances based on alignment between reasoning traces and the model's computational process. This work addresses a key challenge in LLM interpretability and reliability.
Key facts
- CIE-Scorer is a framework for instance-level CoT unfaithfulness detection.
- It uses circuit-guided internal-external discrepancy scoring.
- Existing detectors rely on external signals only.
- Circuit tracing methods are costly for long CoTs.
- The framework aligns reasoning traces with model computation.
- It addresses scalability challenges in circuit tracing.
- The research is published on arXiv with ID 2605.25603.
- The paper was announced as a new submission.
Entities
Institutions
- arXiv