ARTFEED — Contemporary Art Intelligence

New Framework Attacks Vision-Language Models via Dual Modalities

ai-technology · 2026-05-27

Researchers have introduced Multi-Modal Adversarial Synergy (MMAS), a black-box framework that simultaneously generates adversarial perturbations for both images and text to attack Large Vision-Language Models (LVLMs). The image perturbation uses wavelet-based texture constraints, while the text perturbation is a learnable prompt, optimized jointly through model queries. This approach targets vulnerabilities in multi-modal understanding, posing risks to applications like autonomous driving and content moderation. Existing attacks typically focus on single modalities or require white-box access. The paper is available on arXiv under ID 2605.26501.

Key facts

  • MMAS is a black-box multi-modal attack framework.
  • It generates universal adversarial perturbations for images and text.
  • Image perturbation uses wavelet-based texture constraints.
  • Text perturbation is a learnable prompt.
  • Optimization is done jointly using model queries.
  • LVLMs are vulnerable in multi-modal understanding.
  • Risks include autonomous driving and content moderation.
  • Paper ID: arXiv:2605.26501.

Entities

Institutions

  • arXiv

Sources