Sparse Autoencoders Reduce Jailbreak Vulnerability in Large Language Models
Incorporating pretrained Sparse Autoencoders (SAEs) into the residual streams of transformers during inference significantly boosts the resilience of Large Language Models (LLMs) to jailbreak attacks. This approach, applied to models such as Gemma, LLaMA, Mistral, and Qwen, maintains the original model weights and was tested against formidable white-box attacks, GCG and BEAST, alongside three black-box evaluations. Models enhanced with SAEs demonstrated up to a fivefold decrease in jailbreak success rates and diminished attack transferability. A clear monotonic dose-response relationship emerged, indicating that higher L0 sparsity in SAEs leads to reduced attack efficacy. The research, available on arXiv (identifier 2604.18756v1), underscores the capacity of SAEs to bolster LLM security without the need for retraining.
Key facts
- Sparse Autoencoders (SAEs) integrated at inference reduce jailbreak success rates by up to 5x.
- The method was tested on Gemma, LLaMA, Mistral, and Qwen model families.
- Defenses were evaluated against GCG and BEAST white-box attacks and three black-box benchmarks.
- The approach does not modify model weights or block gradients.
- Increased L0 sparsity in SAEs monotonically reduces attack success.
- Intermediate layers offer the best balance between defense robustness and clean performance.
- The technique also reduces cross-model attack transferability.
- The study is documented in arXiv preprint 2604.18756v1.
Entities
Institutions
- arXiv