Five-Stage Methodology for Causal Feature Analysis in Transformer Language Models

ai-technology · 2026-05-23

A study available on arXiv (2605.22462) introduces a five-step approach for analyzing causal features in transformer language models. This process includes probe design, feature extraction, causal validation, robustness testing, and integration for deployment. The methodology is applied to the GPT-2 small model in the Indirect Object Identification (IOI) task, where activation patching successfully retrieves the standard IOI circuit, achieving a recovery of +1.02 with layer-9 head 9. A sparse autoencoder identifies selective features per name, with effect sizes ranging from 30 to 50 activation units. Causal validation reveals that while these features are partially causal, ablating fifteen of them maintains 98% accuracy. Two evaluations inspired by NLA indicate that these features account for only 31% of activation variance compared to 99.7% from the SAE, and a negative correlation exists between selectivity ratio and causal force (r = -0.56). Robustness tests across three distribution shifts show that the circuit remains effective despite variations.

Key facts

arXiv paper 2605.22462 proposes a five-stage methodology for causal feature analysis in transformer language models.
Methodology includes probe design, feature extraction, causal validation, robustness testing, and deployment integration.
Demonstrated on GPT-2 small performing the Indirect Object Identification (IOI) task.
Activation patching recovers the canonical IOI circuit with layer-9 head 9 giving recovery +1.02.
Sparse autoencoder recovers per-name selective features with effect sizes of 30 to 50 activation units.
Ablating fifteen selective features leaves model accurate on 98% of prompts.
Fifteen selective features explain only 31% of activation variance versus SAE's 99.7%.
Selectivity ratio anticorrelates with causal force (r = -0.56).

Five-Stage Methodology for Causal Feature Analysis in Transformer Language Models

Key facts

Entities

Institutions

Sources