ARTFEED — Contemporary Art Intelligence

Visual Attacks Bypass Safety Alignment in Vision-Language Models

ai-technology · 2026-05-04

A recent investigation published on arXiv (2605.00583) indicates that the visual aspect of vision-language models (VLMs) represents a significant yet underexamined vulnerability for circumventing safety measures. The researchers present four jailbreak strategies that leverage the visual element: encoding harmful directives into visual symbol sequences accompanied by a decoding key, substituting dangerous items with harmless alternatives (e.g., bomb → banana) and prompting harmful actions using the substitute, replacing harmful text in images with innocuous words while maintaining the original meaning, and visual analogy puzzles that necessitate deducing a forbidden concept. Testing across six advanced VLMs, these visual methods effectively bypass safety protocols, revealing a gap in cross-modality alignment where text-based safety training fails to extend to visual representations of harmful intent. For example, the visual cipher achieves a 40.9% success rate against Claude-Haiku-4.5, compared to 10.7% for text-only methods. This underscores the necessity for multimodal safety training that incorporates visual attack strategies.

Key facts

  • Four jailbreak attacks exploit the visual modality of VLMs.
  • Attacks include visual symbol sequences, benign substitutes, hidden text, and visual analogy puzzles.
  • Evaluated across six frontier VLMs.
  • Visual cipher achieves 40.9% attack success on Claude-Haiku-4.5.
  • Text-only attack success on Claude-Haiku-4.5 is 10.7%.
  • Cross-modality alignment gap: text safety training does not generalize to visual harmful intent.
  • Study published on arXiv with ID 2605.00583.
  • Announce type is cross.

Entities

Institutions

  • arXiv

Sources