Geometric Framework Reveals Instability in Sparse Autoencoder Feature Steering

ai-technology · 2026-05-09

A recent theoretical study published on arXiv (2605.05223) examines the structural instability associated with feature composition in Sparse Autoencoders (SAEs). Although SAEs facilitate the disentanglement of feature superposition in transformers and allow for activation steering, the research indicates that activating several semantic latents at once may result in compositional collapse. The authors conceptualize the activation space as a high-dimensional sparse cone manifold and establish an asymptotic threshold for collapse using a spherical dictionary model, defined by the Gaussian mean width of the signal cone. Furthermore, they reveal that ReLU rectification transforms correlation-induced variance fluctuations into systematic bias, exacerbating the instability of feature unions. This study questions the Linear Representation Hypothesis by revealing non-linear interference effects in overcomplete dictionaries.

Key facts

Paper on arXiv:2605.05223
Sparse Autoencoders (SAEs) used for feature disentanglement in transformers
Compositional steering involves simultaneous activation of distinct semantic latents
Linear Representation Hypothesis abstracts non-linear interference effects
Geometric framework models activation space as sparse cone manifold
Asymptotic compositional-collapse threshold derived under spherical dictionary model
Threshold characterized by Gaussian mean width of signal cone
ReLU rectification converts variance fluctuations into bias in high-bias regime

Geometric Framework Reveals Instability in Sparse Autoencoder Feature Steering

Key facts

Entities

Institutions

Sources