Graph Kernels for LLM Mechanistic Interpretability
A new framework reframes mechanistic interpretability of large language models as a graph machine-learning problem. Researchers propose representing activation-patching profiles as patch-effect graphs over model components, introducing three graph-construction methods: direct-influence via causal mediation, partial-correlation, and co-influence. Applying graph kernels to GPT-2 Small on Indirect Object Identification tasks shows that patch-effect graphs preserve discriminative structural signals, with localized edge-slot features achieving higher classification accuracy than global methods.
Key facts
- arXiv:2605.06480v1
- Mechanistic interpretability aims to reverse-engineer transformer computations
- Activation patching identifies causal circuits
- Patch-effect graphs represent activation-patching profiles
- Three graph-construction methods introduced
- Evaluated on GPT-2 Small
- Indirect Object Identification (IOI) tasks used
- Localized edge-slot features outperform global methods
Entities
Institutions
- arXiv