New Framework Attacks Vision-Language Models via Dual Modalities
Researchers have introduced Multi-Modal Adversarial Synergy (MMAS), a black-box framework that simultaneously generates adversarial perturbations for both images and text to attack Large Vision-Language Models (LVLMs). The image perturbation uses wavelet-based texture constraints, while the text perturbation is a learnable prompt, optimized jointly through model queries. This approach targets vulnerabilities in multi-modal understanding, posing risks to applications like autonomous driving and content moderation. Existing attacks typically focus on single modalities or require white-box access. The paper is available on arXiv under ID 2605.26501.
Key facts
- MMAS is a black-box multi-modal attack framework.
- It generates universal adversarial perturbations for images and text.
- Image perturbation uses wavelet-based texture constraints.
- Text perturbation is a learnable prompt.
- Optimization is done jointly using model queries.
- LVLMs are vulnerable in multi-modal understanding.
- Risks include autonomous driving and content moderation.
- Paper ID: arXiv:2605.26501.
Entities
Institutions
- arXiv