Unpack: A New Method for Transformer Mechanistic Interpretability

publication · 2026-05-25

A novel approach called Unpack has been introduced by researchers, utilizing a backward recursion technique that breaks down credit through attention and MLP sublayers in transformers. This method reveals interaction strengths among components and provides per-token attribution from a single forward pass, all without requiring intervention, gradients, or extra training. It leverages the shared key-value template φ(S)U, which is prevalent in both attention and MLP layers. When tested on the indirect object identification task with GPT-2 small, Unpack successfully identifies all three composition connections outlined by Wang et al. (2023), including mode-specific routing (K, Q, or V). Additionally, it showcases token-level attribution by analyzing two instances of the same name in one decomposition. This research is documented in arXiv preprint 2605.23393.

Key facts

Unpack is a backward recursion method for transformer interpretability.
It decomposes credit through attention and MLP sublayers.
Produces interaction strengths between any two components.
Generates per-token attribution from a single forward pass.
No intervention, gradients, or auxiliary training required.
Evaluated on indirect object identification task with GPT-2 small.
Recovers all three composition connections from Wang et al. (2023).
Includes mode-specific routing labels (K, Q, V).

Unpack: A New Method for Transformer Mechanistic Interpretability

Key facts

Entities

Institutions

Sources