Graph-Based Analysis of Sparse Autoencoder Features via WL Kernel
A recent study presents a novel graph-based method for examining features from sparse autoencoders (SAE), advancing from simple token lists to identify complex co-occurrence patterns. In this framework, each SAE feature is represented as a token co-occurrence graph, where nodes symbolize frequently occurring tokens near significant activations, and edges link co-occurring tokens within localized context windows. To measure similarity within this structural framework, a specialized Weisfeiler-Lehman-style frequency-binned graph kernel is utilized. This approach was demonstrated using features from a large SAE trained on GPT-2 Small, analyzed with a synthetic mixed-domain corpus, successfully clustering heuristic motif families like punctuation-rich patterns and language-specific groups. The findings are available on arXiv, under ID 2605.06494.
Key facts
- Sparse autoencoders (SAEs) decompose transformer activations into monosemantic features.
- Existing analyses rely on top-activating token lists or decoder weight vectors.
- The paper models each SAE feature as a token co-occurrence graph.
- Nodes are tokens frequent near strong activations; edges connect co-occurring tokens.
- A custom WL-style frequency-binned graph kernel measures similarity.
- Proof of concept uses a large SAE trained on GPT-2 Small.
- The corpus is synthetic and mixed-domain.
- Clustering recovers heuristic motif families like punctuation-heavy patterns.
Entities
Institutions
- arXiv