Visual Attacks Bypass Safety Alignment in Vision-Language Models
A recent investigation published on arXiv (2605.00583) indicates that the visual aspect of vision-language models (VLMs) represents a significant yet underexamined vulnerability for circumventing safety measures. The researchers present four jailbreak strategies that leverage the visual element: encoding harmful directives into visual symbol sequences accompanied by a decoding key, substituting dangerous items with harmless alternatives (e.g., bomb → banana) and prompting harmful actions using the substitute, replacing harmful text in images with innocuous words while maintaining the original meaning, and visual analogy puzzles that necessitate deducing a forbidden concept. Testing across six advanced VLMs, these visual methods effectively bypass safety protocols, revealing a gap in cross-modality alignment where text-based safety training fails to extend to visual representations of harmful intent. For example, the visual cipher achieves a 40.9% success rate against Claude-Haiku-4.5, compared to 10.7% for text-only methods. This underscores the necessity for multimodal safety training that incorporates visual attack strategies.
Key facts
- Four jailbreak attacks exploit the visual modality of VLMs.
- Attacks include visual symbol sequences, benign substitutes, hidden text, and visual analogy puzzles.
- Evaluated across six frontier VLMs.
- Visual cipher achieves 40.9% attack success on Claude-Haiku-4.5.
- Text-only attack success on Claude-Haiku-4.5 is 10.7%.
- Cross-modality alignment gap: text safety training does not generalize to visual harmful intent.
- Study published on arXiv with ID 2605.00583.
- Announce type is cross.
Entities
Institutions
- arXiv