Universal Adversarial Attacks on Vision-Language Models: A Dual-Dimension Evaluation

ai-technology · 2026-05-06

A recent study published on arXiv (2605.01449) questions the understanding of success rates for universal adversarial attacks on aligned multimodal large language models. Previous research has indicated a success rate between 60-80%, but the authors contend that this figure merges two separate phenomena: Influence (modification of output) and Precise Injection (delivery of the attacker's intended concept). They integrate Universal Adversarial Attack and AnyAttack within an L_inf budget of 16/255, proposing a dual-axis assessment that employs a deterministic Ratcliff-Obershelp drift score for Influence and a four-tier ordinal scale for Precise Injection. DeepSeek-V4-Pro serves as the judge in thinking mode, calibrated against Claude Opus 4.7, achieving Cohen's κ = 0.77 on the injection axis, reflecting considerable agreement. The research seeks to clarify vulnerability metrics in vision-language models.

Key facts

arXiv paper 2605.01449 critiques universal adversarial attack success rates on multimodal LLMs.
Argues that 60-80% success conflates Influence and Precise Injection.
Combines Universal Adversarial Attack and AnyAttack under L_inf budget of 16/255.
Introduces dual-axis evaluation: Ratcliff-Obershelp drift score for Influence, 4-tier ordinal scale for Precise Injection.
Uses DeepSeek-V4-Pro in thinking mode as judge.
Calibrated against Claude Opus 4.7 with Cohen's κ = 0.77 on injection axis.
Focuses on disentangling vulnerability metrics in vision-language models.

Universal Adversarial Attacks on Vision-Language Models: A Dual-Dimension Evaluation

Key facts

Entities

Institutions

Sources