ARTFEED — Contemporary Art Intelligence

Graph-Based Analysis of Sparse Autoencoder Features via WL Kernel

other · 2026-05-09

A recent study presents a novel graph-based method for examining features from sparse autoencoders (SAE), advancing from simple token lists to identify complex co-occurrence patterns. In this framework, each SAE feature is represented as a token co-occurrence graph, where nodes symbolize frequently occurring tokens near significant activations, and edges link co-occurring tokens within localized context windows. To measure similarity within this structural framework, a specialized Weisfeiler-Lehman-style frequency-binned graph kernel is utilized. This approach was demonstrated using features from a large SAE trained on GPT-2 Small, analyzed with a synthetic mixed-domain corpus, successfully clustering heuristic motif families like punctuation-rich patterns and language-specific groups. The findings are available on arXiv, under ID 2605.06494.

Key facts

  • Sparse autoencoders (SAEs) decompose transformer activations into monosemantic features.
  • Existing analyses rely on top-activating token lists or decoder weight vectors.
  • The paper models each SAE feature as a token co-occurrence graph.
  • Nodes are tokens frequent near strong activations; edges connect co-occurring tokens.
  • A custom WL-style frequency-binned graph kernel measures similarity.
  • Proof of concept uses a large SAE trained on GPT-2 Small.
  • The corpus is synthetic and mixed-domain.
  • Clustering recovers heuristic motif families like punctuation-heavy patterns.

Entities

Institutions

  • arXiv

Sources