ARTFEED — Contemporary Art Intelligence

Graph Kernels for LLM Mechanistic Interpretability

ai-technology · 2026-05-09

A new framework reframes mechanistic interpretability of large language models as a graph machine-learning problem. Researchers propose representing activation-patching profiles as patch-effect graphs over model components, introducing three graph-construction methods: direct-influence via causal mediation, partial-correlation, and co-influence. Applying graph kernels to GPT-2 Small on Indirect Object Identification tasks shows that patch-effect graphs preserve discriminative structural signals, with localized edge-slot features achieving higher classification accuracy than global methods.

Key facts

  • arXiv:2605.06480v1
  • Mechanistic interpretability aims to reverse-engineer transformer computations
  • Activation patching identifies causal circuits
  • Patch-effect graphs represent activation-patching profiles
  • Three graph-construction methods introduced
  • Evaluated on GPT-2 Small
  • Indirect Object Identification (IOI) tasks used
  • Localized edge-slot features outperform global methods

Entities

Institutions

  • arXiv

Sources