New Method Identifies LLM Refusal Mechanisms via Sparse Autoencoders
Researchers have developed a novel pipeline using sparse autoencoders (SAEs) to dissect refusal behavior in instruction-tuned large language models (LLMs). The study, published on arXiv (2509.09708), examines two public models: Gemma-2-2B-IT and LLaMA-3.1-8B-IT. By training SAEs on residual-stream activations, the team searches for feature sets whose ablation causes the model to switch from refusal to compliance, effectively creating a jailbreak. The three-stage process includes: (1) finding a refusal-mediating direction and collecting nearby SAE features; (2) greedy filtering to a minimal set; and (3) interaction discovery using a factorization machine to capture nonlinear interactions among active features. This approach yields a broad set of jailbreak-critical features, offering insight into the internal causes of refusal. The work aims to improve understanding of safety mechanisms in LLMs.
Key facts
- Study uses sparse autoencoders (SAEs) trained on residual-stream activations.
- Models analyzed: Gemma-2-2B-IT and LLaMA-3.1-8B-IT.
- Three-stage pipeline: Refusal Direction, Greedy Filtering, Interaction Discovery.
- Ablation of identified feature sets flips model from refusal to compliance.
- Factorization machine captures nonlinear interactions among features.
- Pipeline yields a broad set of jailbreak-critical features.
- Published on arXiv with ID 2509.09708.
- Focuses on instruction-tuned LLMs.
Entities
Institutions
- arXiv