ARTFEED — Contemporary Art Intelligence

Geometric Framework Reveals Instability in Sparse Autoencoder Feature Steering

ai-technology · 2026-05-09

A recent theoretical study published on arXiv (2605.05223) examines the structural instability associated with feature composition in Sparse Autoencoders (SAEs). Although SAEs facilitate the disentanglement of feature superposition in transformers and allow for activation steering, the research indicates that activating several semantic latents at once may result in compositional collapse. The authors conceptualize the activation space as a high-dimensional sparse cone manifold and establish an asymptotic threshold for collapse using a spherical dictionary model, defined by the Gaussian mean width of the signal cone. Furthermore, they reveal that ReLU rectification transforms correlation-induced variance fluctuations into systematic bias, exacerbating the instability of feature unions. This study questions the Linear Representation Hypothesis by revealing non-linear interference effects in overcomplete dictionaries.

Key facts

  • Paper on arXiv:2605.05223
  • Sparse Autoencoders (SAEs) used for feature disentanglement in transformers
  • Compositional steering involves simultaneous activation of distinct semantic latents
  • Linear Representation Hypothesis abstracts non-linear interference effects
  • Geometric framework models activation space as sparse cone manifold
  • Asymptotic compositional-collapse threshold derived under spherical dictionary model
  • Threshold characterized by Gaussian mean width of signal cone
  • ReLU rectification converts variance fluctuations into bias in high-bias regime

Entities

Institutions

  • arXiv

Sources