Multimodal Jailbreak Robustness in Vision-Language Models

ai-technology · 2026-05-28

A recent study, available as arXiv preprint 2605.27932, looks into the safety of think-with-image reasoning in large vision-language models (VLMs). It explores four different methods of inference: generating direct responses, using text-only prior turns, manipulating visual states, and invoking external image tools. The findings show that engaging directly with image tools leads to the lowest rates of attack success, reducing jailbreak attempts by around 30% on average across several VLMs. Interestingly, even when the tool's output is changed or considered unsafe, the attack success rates remain low. However, they spike to levels similar to direct-answering when relying on text-only prior turns. This highlights the importance of designing processes to enhance multimodal safety.

Key facts

Study examines think-with-image reasoning safety in VLMs.
Four inference paradigms compared: direct response, text-only prior turn, visual-state manipulation, explicit image-tool invocation.
Explicit image-tool interaction yields lowest ASR.
Jailbreak success reduced by ~30% relative on average.
ASR remains low even when tool output is overridden or unsafe.
Text-only prior turn controls restore near direct-answering ASR.
Lower ASR not explained by tool output content.
Research published on arXiv (2605.27932).

Multimodal Jailbreak Robustness in Vision-Language Models

Key facts

Entities

Institutions

Sources