Causal interventions in neural networks can create divergent representations

ai-technology · 2026-04-25

A new study on arXiv investigates whether causal interventions in neural networks produce out-of-distribution (divergent) representations, potentially undermining the faithfulness of mechanistic interpretability explanations. The authors demonstrate both theoretically and empirically that common intervention techniques often shift internal representations away from the natural distribution. They analyze two types of divergence: 'harmless' ones in the behavioral null-space and 'pernicious' ones that activate hidden pathways. The work proposes mitigation strategies for pernicious cases.

Key facts

Study appears on arXiv with ID 2511.04638
Focuses on mechanistic interpretability of neural networks
Causal interventions can create divergent representations
Two types of divergence identified: harmless and pernicious
Pernicious divergences activate hidden network pathways
Mitigation strategies are proposed for pernicious cases
Theoretical and empirical evidence provided
Concerns raised about faithfulness of explanations

Causal interventions in neural networks can create divergent representations

Key facts

Entities

Institutions

Sources