AI Agents Automate Interpretability of Large Language Models

ai-technology · 2026-05-06

An autonomous multiagent framework aimed at mechanistic interpretability has been developed by researchers, enabling the automation of both the explanation and identification of internal features within large language models. This system operates through two interconnected processes: the first involves explanation refinement, where an agent formulates competing hypotheses and tests them iteratively using targeted prompt controls and a multi-metric evaluation. The second process, feature discovery, sees the agent create prompt sets, build a k-nearest-neighbor graph in activation space, and identify candidate features based on statistical separability and semantic coherence. In experiments with Gemma-2 family models and MLP neurons in weight-sparse transformers, the agent surpasses one-shot auto-interpretations, uncovers language-specific and safety-relevant features, and generates auditable explanation traces. This research illustrates that agent-driven empirical loops produce more precise and testable explanations compared to one-shot labels.

Key facts

The framework automates both explaining and finding internal features in large language models.
It uses two coupled loops: explanation refinement and feature discovery.
Explanation refinement involves an agent proposing competing hypotheses and testing them with prompt controls and multi-metric evaluation.
Feature discovery uses prompt sets, k-nearest-neighbor graphs, and statistical separability criteria.
The system was tested on Gemma-2 family models and MLP neurons in weight-sparse transformers.
It outperforms one-shot auto-interpretations.
It discovers language-specific and safety-relevant features.
It produces auditable explanation traces.

Entities

—

Sources

arXiv cs.AI — 2026-05-05