Unpack: A New Method for Transformer Mechanistic Interpretability
A novel approach called Unpack has been introduced by researchers, utilizing a backward recursion technique that breaks down credit through attention and MLP sublayers in transformers. This method reveals interaction strengths among components and provides per-token attribution from a single forward pass, all without requiring intervention, gradients, or extra training. It leverages the shared key-value template φ(S)U, which is prevalent in both attention and MLP layers. When tested on the indirect object identification task with GPT-2 small, Unpack successfully identifies all three composition connections outlined by Wang et al. (2023), including mode-specific routing (K, Q, or V). Additionally, it showcases token-level attribution by analyzing two instances of the same name in one decomposition. This research is documented in arXiv preprint 2605.23393.
Key facts
- Unpack is a backward recursion method for transformer interpretability.
- It decomposes credit through attention and MLP sublayers.
- Produces interaction strengths between any two components.
- Generates per-token attribution from a single forward pass.
- No intervention, gradients, or auxiliary training required.
- Evaluated on indirect object identification task with GPT-2 small.
- Recovers all three composition connections from Wang et al. (2023).
- Includes mode-specific routing labels (K, Q, V).
Entities
Institutions
- arXiv