New Method Identifies LLM Refusal Mechanisms via Sparse Autoencoders

ai-technology · 2026-04-30

Researchers have developed a novel pipeline using sparse autoencoders (SAEs) to dissect refusal behavior in instruction-tuned large language models (LLMs). The study, published on arXiv (2509.09708), examines two public models: Gemma-2-2B-IT and LLaMA-3.1-8B-IT. By training SAEs on residual-stream activations, the team searches for feature sets whose ablation causes the model to switch from refusal to compliance, effectively creating a jailbreak. The three-stage process includes: (1) finding a refusal-mediating direction and collecting nearby SAE features; (2) greedy filtering to a minimal set; and (3) interaction discovery using a factorization machine to capture nonlinear interactions among active features. This approach yields a broad set of jailbreak-critical features, offering insight into the internal causes of refusal. The work aims to improve understanding of safety mechanisms in LLMs.

Key facts

Study uses sparse autoencoders (SAEs) trained on residual-stream activations.
Models analyzed: Gemma-2-2B-IT and LLaMA-3.1-8B-IT.
Three-stage pipeline: Refusal Direction, Greedy Filtering, Interaction Discovery.
Ablation of identified feature sets flips model from refusal to compliance.
Factorization machine captures nonlinear interactions among features.
Pipeline yields a broad set of jailbreak-critical features.
Published on arXiv with ID 2509.09708.
Focuses on instruction-tuned LLMs.

New Method Identifies LLM Refusal Mechanisms via Sparse Autoencoders

Key facts

Entities

Institutions

Sources