SAEgis: Sparse Autoencoder Firewall for VLM Adversarial Attacks

ai-technology · 2026-05-11

Researchers have developed SAEgis, a novel lightweight detection system aimed at identifying adversarial attacks on vision-language models (VLMs) utilizing sparse autoencoders. By integrating an SAE module into a pretrained VLM, the system is trained on standard reconstruction objectives, thus enhancing its ability to recognize attack signals through learned sparse features. Testing indicates that SAEgis effectively detects images altered by adversaries, even those previously unseen. The study highlights the susceptibility of both proprietary and open-source VLMs to such attacks, underscoring the necessity for implementing safety measures in practical applications. The findings are available on arXiv.

Key facts

SAEgis is a novel adversarial attack detection framework for VLMs.
It uses sparse autoencoders as plug-and-play modules.
The SAE is trained with standard reconstruction objectives.
Learned sparse features capture attack-relevant signals.
SAEgis detects adversarial perturbations even on unseen samples.
Experiments show strong in-domain and cross-domain performance.
Proprietary and open-weight VLMs remain highly vulnerable to attacks.
The paper is published on arXiv with ID 2605.07447.

SAEgis: Sparse Autoencoder Firewall for VLM Adversarial Attacks

Key facts

Entities

Institutions

Sources