Sparse Autoencoders Match LoRA on LLM Steering Benchmark
A recent study published on arXiv (2605.31183) disputes previous conclusions suggesting that Sparse Autoencoders (SAEs) do not perform as well as basic baselines for guiding Large Language Models (LLMs). The researchers reveal that by employing a supervised feature selection and labeling process, SAEs can achieve results comparable to LoRA on the AxBench benchmark, countering the findings of Wu et al. (2025). Furthermore, the method identifies features that are unexpectedly causal to their labels, relying solely on interpretability-based elements. The research also indicates that having high sparsity (low l0) may not be essential for effective steering.
Key facts
- Paper arXiv:2605.31183v1 is a partial rebuttal to Wu et al. (2025).
- SAEs with supervised pipeline perform close to LoRA on AxBench.
- Pipeline selects causal features using interpretability-based components.
- High sparsity (low l0) may not be crucial for SAE performance.
- AxBench is a model steering benchmark introduced in Wu et al. (2025).
- SAEs are used for exploring LLM internals and steering output generation.
- Wu et al. (2025) reported poor SAE steering relative to simple baselines.
- The paper suggests SAEs were not done full justice by earlier results.
Entities
Institutions
- arXiv