Universal Adversarial Attacks on Vision-Language Models: A Dual-Dimension Evaluation
A recent study published on arXiv (2605.01449) questions the understanding of success rates for universal adversarial attacks on aligned multimodal large language models. Previous research has indicated a success rate between 60-80%, but the authors contend that this figure merges two separate phenomena: Influence (modification of output) and Precise Injection (delivery of the attacker's intended concept). They integrate Universal Adversarial Attack and AnyAttack within an L_inf budget of 16/255, proposing a dual-axis assessment that employs a deterministic Ratcliff-Obershelp drift score for Influence and a four-tier ordinal scale for Precise Injection. DeepSeek-V4-Pro serves as the judge in thinking mode, calibrated against Claude Opus 4.7, achieving Cohen's κ = 0.77 on the injection axis, reflecting considerable agreement. The research seeks to clarify vulnerability metrics in vision-language models.
Key facts
- arXiv paper 2605.01449 critiques universal adversarial attack success rates on multimodal LLMs.
- Argues that 60-80% success conflates Influence and Precise Injection.
- Combines Universal Adversarial Attack and AnyAttack under L_inf budget of 16/255.
- Introduces dual-axis evaluation: Ratcliff-Obershelp drift score for Influence, 4-tier ordinal scale for Precise Injection.
- Uses DeepSeek-V4-Pro in thinking mode as judge.
- Calibrated against Claude Opus 4.7 with Cohen's κ = 0.77 on injection axis.
- Focuses on disentangling vulnerability metrics in vision-language models.
Entities
Institutions
- arXiv
- DeepSeek
- Claude