Data-Driven Circuit Discovery Challenges Assumptions in Language Model Interpretability

publication · 2026-05-12

A recent investigation rigorously evaluates the foundational assumptions behind circuit discovery techniques for language models (LMs). The goal of circuit discovery is to identify and interpret computational subgraphs that handle specific tasks. Current approaches are based on hypotheses: they establish a task using a dataset and utilize an algorithm to identify one circuit. This relies on the premise that the LM operates with a single circuit per task and that the dataset adequately represents the task. Researchers examined these assumptions across four tasks previously analyzed. They discovered that slight variations in datasets, while maintaining task semantics, resulted in circuits with minimal edge overlap and low cross-dataset reliability. Even when tested on a combined dataset with two separate tasks, existing methods still identified a single circuit, highlighting their inadequacy in addressing task multiplicity. The findings indicate that circuit discovery should adopt data-driven strategies that consider multiple circuits for each task and recognize dataset biases.

Key facts

Circuit discovery aims to explain LM behavior by localizing computational subgraphs.
Existing methods assume a single circuit per task and dataset adequacy.
Four previously studied tasks were tested.
Minor dataset variations produce circuits with low edge overlap.
Cross-dataset faithfulness is low under dataset variations.
Mixed datasets with two distinct tasks yield circuits with near-zero cross-faithfulness.
Existing methods still produce a single circuit for mixed tasks.
The study calls for data-driven circuit discovery methods.

Data-Driven Circuit Discovery Challenges Assumptions in Language Model Interpretability

Key facts

Entities

Institutions

Sources