Sparse Backdoor: Provably Undetectable Supply-Chain Attack on Image Classifiers
A novel supply-chain attack named Sparse Backdoor has been developed by researchers, capable of embedding an undetectable backdoor in pre-trained image classifiers, such as convolutional networks and Vision Transformers. This method introduces a structured sparse perturbation directed randomly into a limited number of columns within each fully connected layer, sending a trigger signal to a target class selected by an adversary. An independent isotropic Gaussian dither masks the perturbation, creating a clean reference distribution based on the pre-trained weights, which allows for formal undetectability. The authors demonstrate that differentiating the backdoor-infected model from this reference is at least as challenging as detecting Sparse PCA, a task that is computationally difficult. This attack underscores new security vulnerabilities in deep learning, raising concerns about AI safety and reliability.
Key facts
- Sparse Backdoor is a supply-chain attack on pre-trained image classifiers.
- It targets convolutional networks and Vision Transformers.
- The attack injects a structured sparse perturbation into fully connected layers.
- A Gaussian dither masks the perturbation and enables formal undetectability.
- The dithered reference is functionally equivalent to the original classifier under a margin condition.
- Detecting the backdoor is at least as hard as Sparse PCA detection.
- The attack propagates a trigger signal to an adversary-chosen target class.
- The paper is published on arXiv with ID 2605.04209.
Entities
Institutions
- arXiv