ARTFEED — Contemporary Art Intelligence

New Method Identifies LLM Refusal Mechanisms via Sparse Autoencoders

ai-technology · 2026-04-30

Researchers have developed a novel pipeline using sparse autoencoders (SAEs) to dissect refusal behavior in instruction-tuned large language models (LLMs). The study, published on arXiv (2509.09708), examines two public models: Gemma-2-2B-IT and LLaMA-3.1-8B-IT. By training SAEs on residual-stream activations, the team searches for feature sets whose ablation causes the model to switch from refusal to compliance, effectively creating a jailbreak. The three-stage process includes: (1) finding a refusal-mediating direction and collecting nearby SAE features; (2) greedy filtering to a minimal set; and (3) interaction discovery using a factorization machine to capture nonlinear interactions among active features. This approach yields a broad set of jailbreak-critical features, offering insight into the internal causes of refusal. The work aims to improve understanding of safety mechanisms in LLMs.

Key facts

  • Study uses sparse autoencoders (SAEs) trained on residual-stream activations.
  • Models analyzed: Gemma-2-2B-IT and LLaMA-3.1-8B-IT.
  • Three-stage pipeline: Refusal Direction, Greedy Filtering, Interaction Discovery.
  • Ablation of identified feature sets flips model from refusal to compliance.
  • Factorization machine captures nonlinear interactions among features.
  • Pipeline yields a broad set of jailbreak-critical features.
  • Published on arXiv with ID 2509.09708.
  • Focuses on instruction-tuned LLMs.

Entities

Institutions

  • arXiv

Sources