Graph Kernels for LLM Mechanistic Interpretability

ai-technology · 2026-05-09

A new framework reframes mechanistic interpretability of large language models as a graph machine-learning problem. Researchers propose representing activation-patching profiles as patch-effect graphs over model components, introducing three graph-construction methods: direct-influence via causal mediation, partial-correlation, and co-influence. Applying graph kernels to GPT-2 Small on Indirect Object Identification tasks shows that patch-effect graphs preserve discriminative structural signals, with localized edge-slot features achieving higher classification accuracy than global methods.

Key facts

arXiv:2605.06480v1
Mechanistic interpretability aims to reverse-engineer transformer computations
Activation patching identifies causal circuits
Patch-effect graphs represent activation-patching profiles
Three graph-construction methods introduced
Evaluated on GPT-2 Small
Indirect Object Identification (IOI) tasks used
Localized edge-slot features outperform global methods

Graph Kernels for LLM Mechanistic Interpretability

Key facts

Entities

Institutions

Sources