ARTFEED — Contemporary Art Intelligence

Unpack: A New Method for Transformer Mechanistic Interpretability

publication · 2026-05-25

A novel approach called Unpack has been introduced by researchers, utilizing a backward recursion technique that breaks down credit through attention and MLP sublayers in transformers. This method reveals interaction strengths among components and provides per-token attribution from a single forward pass, all without requiring intervention, gradients, or extra training. It leverages the shared key-value template φ(S)U, which is prevalent in both attention and MLP layers. When tested on the indirect object identification task with GPT-2 small, Unpack successfully identifies all three composition connections outlined by Wang et al. (2023), including mode-specific routing (K, Q, or V). Additionally, it showcases token-level attribution by analyzing two instances of the same name in one decomposition. This research is documented in arXiv preprint 2605.23393.

Key facts

  • Unpack is a backward recursion method for transformer interpretability.
  • It decomposes credit through attention and MLP sublayers.
  • Produces interaction strengths between any two components.
  • Generates per-token attribution from a single forward pass.
  • No intervention, gradients, or auxiliary training required.
  • Evaluated on indirect object identification task with GPT-2 small.
  • Recovers all three composition connections from Wang et al. (2023).
  • Includes mode-specific routing labels (K, Q, V).

Entities

Institutions

  • arXiv

Sources