Diff-SAE Outperforms Crosscoders in Detecting Backdoor Attacks on LLMs

ai-technology · 2026-05-11

A recent research paper from arXiv (2605.07324) investigates two sparse autoencoder models—Crosscoders and Differential SAEs (Diff-SAE)—focused on identifying features linked to backdoors in fine-tuned language models. The study utilized a controlled SQL injection backdoor, where context based on the year ("2024" activates vulnerable code, while "2023" activates secure code) was employed. The researchers assessed both models under LoRA and full-rank fine-tuning conditions using SmolLM2-360M. Diff-SAE showed a significant advantage over Crosscoders, achieving a Backdoor Isolation Score (BIS) of 0.40, perfect precision (1.0), and no false positives in most tests, whereas Crosscoders performed poorly. This research tackles the challenge of backdoor detection through mechanistic interpretability, showcasing Diff-SAE's promise for enhancing AI safety.

Key facts

Study compares Crosscoders and Differential SAEs (Diff-SAE) for backdoor detection.
Backdoor attack uses SQL injection triggered by year-based context: '2024' triggers vulnerable code, '2023' triggers safe code.
Evaluated on SmolLM2-360M across LoRA and full-rank fine-tuning regimes.
Diff-SAE achieves Backdoor Isolation Score (BIS) of 0.40 with perfect precision (1.0) and zero false positive rate.
Crosscoders fail almost entirely across most experimental conditions.
Research aims to improve detection of backdoors through mechanistic interpretability.
Published on arXiv with identifier 2605.07324.
Backdoor attacks pose a significant threat to AI safety.

Diff-SAE Outperforms Crosscoders in Detecting Backdoor Attacks on LLMs

Key facts

Entities

Institutions

Sources