Graph-Regularized Sparse Autoencoders Improve LLM Safety Steering

ai-technology · 2026-05-18

Researchers have introduced a new technique for dictionary learning called Graph-Regularized Sparse Autoencoders (GSAE), aimed at enhancing safety steering in large language models (LLMs). Unlike traditional sparse autoencoders (SAEs), which treat latent features as separate entities, GSAE addresses the complexities of safety behaviors like refusal and harmful compliance by smoothing decoder vectors using a neuron co-activation graph. This method employs a two-gate runtime controller with the resulting direction bank. Test results indicate that GSAE improves selective refusal across three benchmarks: JailbreakBench, HarmBench, and XSTest. Notably, when GSAE replaced the standard SAE in the Llama-3-8B pipeline, it achieved a 20.1-point boost on JailbreakBench. These results are detailed in arXiv:2512.06655v3.

Key facts

GSAE is a new dictionary-learning method for LLM safety steering.
Standard SAEs treat latent features as independent, mismatching safety behaviors.
GSAE smooths SAE decoder vectors over a neuron co-activation graph.
A two-gate runtime controller applies the direction bank.
GSAE improves selective refusal on JailbreakBench, HarmBench, and XSTest.
On Llama-3-8B, GSAE improves Δs by 20.1 points on JailbreakBench.
The paper is arXiv:2512.06655v3.
The method increases harmful-request refusal while keeping benign refusals low.

Graph-Regularized Sparse Autoencoders Improve LLM Safety Steering

Key facts

Entities

Institutions

Sources