Sparse Autoencoders' Ability to Capture Concept Manifolds Questioned

ai-technology · 2026-05-01

A new theoretical framework challenges the assumption that sparse autoencoders (SAEs) capture concepts as independent linear directions, proposing instead that concepts are organized along low-dimensional manifolds. The study, published on arXiv (2604.28119), identifies two modes of manifold capture: global, where a compact group of atoms spans the entire manifold, and local, where features tile restricted regions. Empirical results show SAEs suboptimally recover continuous structures, mixing features in ways that obscure geometric relationships. The research raises fundamental questions about interpretability in neural network representations.

Key facts

Sparse autoencoders are used to extract interpretable features from neural networks.
Concepts may be organized along low-dimensional manifolds, not independent linear directions.
The study develops a theoretical framework for understanding SAE manifold capture.
Two modes identified: global (compact group of atoms spans manifold) and local (features tile regions).
Empirical findings show SAEs suboptimally recover continuous structures.
The paper is published on arXiv with ID 2604.28119.
The research challenges the implicit assumption of independent linear concept directions.
The work is categorized under artificial intelligence and machine learning.

Sparse Autoencoders' Ability to Capture Concept Manifolds Questioned

Key facts

Entities

Institutions

Sources