DMN Framework Jailbreaks Multimodal LLMs with Multi-Image Inputs
Researchers propose DMN, a compositional jailbreak framework targeting multimodal large language models (MLLMs) that accept multi-image inputs. Unlike prior single-image methods, DMN distributes harmful instructions across multiple images, uses multimodal evidence, and introduces a number chain task to distract the model. Experiments show attack success rates over 90% on GPT-4o, Gemini-2.5-pro, and Claude Sonnet 4. The paper highlights vulnerabilities from insufficient multi-image safety alignment.
Key facts
- DMN stands for Distributed instruction, Multimodal evidence, and Number chain task.
- Achieves over 90% attack success rate on GPT-4o, Gemini-2.5-pro, and Claude Sonnet 4.
- Exploits multi-image inputs to bypass safety alignment.
- Previous methods only used single images, limiting attack space.
- Published on arXiv with ID 2605.18915.
Entities
Institutions
- arXiv