Graph-Based Analysis of Sparse Autoencoder Features via WL Kernel

other · 2026-05-09

A recent study presents a novel graph-based method for examining features from sparse autoencoders (SAE), advancing from simple token lists to identify complex co-occurrence patterns. In this framework, each SAE feature is represented as a token co-occurrence graph, where nodes symbolize frequently occurring tokens near significant activations, and edges link co-occurring tokens within localized context windows. To measure similarity within this structural framework, a specialized Weisfeiler-Lehman-style frequency-binned graph kernel is utilized. This approach was demonstrated using features from a large SAE trained on GPT-2 Small, analyzed with a synthetic mixed-domain corpus, successfully clustering heuristic motif families like punctuation-rich patterns and language-specific groups. The findings are available on arXiv, under ID 2605.06494.

Key facts

Sparse autoencoders (SAEs) decompose transformer activations into monosemantic features.
Existing analyses rely on top-activating token lists or decoder weight vectors.
The paper models each SAE feature as a token co-occurrence graph.
Nodes are tokens frequent near strong activations; edges connect co-occurring tokens.
A custom WL-style frequency-binned graph kernel measures similarity.
Proof of concept uses a large SAE trained on GPT-2 Small.
The corpus is synthetic and mixed-domain.
Clustering recovers heuristic motif families like punctuation-heavy patterns.

Graph-Based Analysis of Sparse Autoencoder Features via WL Kernel

Key facts

Entities

Institutions

Sources