Knowledge Graphs from Sparse Autoencoder Features
A new method extracts domain-specific knowledge graphs from sparse autoencoder features in language models. The approach filters millions of features using contrastive activations, then builds co-occurrence and transcoder-based graphs with automated edge labeling. A case study on a biology textbook demonstrates the technique.
Key facts
- Sparse autoencoders extract millions of interpretable features from language models.
- Domain concepts are mixed with generic and weakly grounded features.
- Contrastive activations and multi-stage filtering construct a domain-specific concept universe.
- Two aligned graph views are built: a co-occurrence graph and a transcoder-based mechanism graph.
- Automated edge labeling turns graph views into readable knowledge graphs.
- A case study was conducted on a biology textbook.
- The method addresses scattering of related ideas across many units.
- The approach organizes conceptual structure at multiple levels of granularity.
Entities
—