Causal interventions in neural networks can create divergent representations
A new study on arXiv investigates whether causal interventions in neural networks produce out-of-distribution (divergent) representations, potentially undermining the faithfulness of mechanistic interpretability explanations. The authors demonstrate both theoretically and empirically that common intervention techniques often shift internal representations away from the natural distribution. They analyze two types of divergence: 'harmless' ones in the behavioral null-space and 'pernicious' ones that activate hidden pathways. The work proposes mitigation strategies for pernicious cases.
Key facts
- Study appears on arXiv with ID 2511.04638
- Focuses on mechanistic interpretability of neural networks
- Causal interventions can create divergent representations
- Two types of divergence identified: harmless and pernicious
- Pernicious divergences activate hidden network pathways
- Mitigation strategies are proposed for pernicious cases
- Theoretical and empirical evidence provided
- Concerns raised about faithfulness of explanations
Entities
Institutions
- arXiv