Mechanistic Interpretability Papers Lack Causal Identification Assumptions

other · 2026-05-11

A recent study published on arXiv (2605.08012) highlights that research in AI focused on mechanistic interpretability increasingly employs causal terminology—such as circuits, mediators, causal abstraction, and monosemanticity—yet neglects to reveal the identification assumptions necessary for substantiating causal assertions. The authors performed a targeted review of 10 papers spanning four methodological approaches and discovered an absence of sections dedicated to identification assumptions. Instead, they noted that validation metrics like faithfulness, completeness, monosemanticity, alignment, or ablation effects are presented as causal evidence without clarifying the underlying assumptions. A secondary audit by two human coders on n=30 confirmed the primary finding: the lack of identification sections and the frequent replacement of validation metrics. The authors suggest a norm for disclosure: specify if a claim is causal, identify the strategy, list assumptions, emphasize at least one, and clarify how conclusions might change if those assumptions are not met.

Key facts

Paper on arXiv with ID 2605.08012
Audit of 10 papers across four methodological strands
No dedicated identification-assumptions section found
Validation metrics used as causal support without assumptions
Two-human-coder audit on n=30 reproduced findings
Proposes disclosure norm for causal claims
Causal vocabulary includes circuits, mediators, causal abstraction, monosemanticity
Metrics include faithfulness, completeness, monosemanticity, alignment, ablation effects

Mechanistic Interpretability Papers Lack Causal Identification Assumptions

Key facts

Entities

Institutions

Sources