Reconstruction-Concealment Tradeoff in MLLM Jailbreak Attacks
A new arXiv paper (2605.05709) reveals a fundamental tradeoff in jailbreak attacks on multimodal large language models (MLLMs). Intent-obfuscation attacks transform harmful queries into concealed multimodal inputs to bypass safety filters. The study shows these attacks are governed by a reconstruction-concealment tradeoff: the transformed input must hide harmful intent while remaining recoverable. Analysis of three black-box methods found existing transformations struggle to balance this tradeoff. Character-removed variants achieve better balance. The authors propose concealment-aware variant construction, which greedily selects diverse character-removed variants with low harmful-keyword alignment, instantiated through five modality-aware prompting strategies.
Key facts
- Paper ID: arXiv:2605.05709
- Title: Conceal, Reconstruct, Jailbreak: Exploiting the Reconstruction-Concealment Tradeoff in MLLMs
- Focuses on intent-obfuscation-based jailbreak attacks on multimodal large language models (MLLMs)
- Identifies a reconstruction-concealment tradeoff governing such attacks
- Analyzes three representative black-box methods
- Finds existing transformations struggle to balance the tradeoff
- Character-removed variants achieve a better balance
- Proposes concealment-aware variant construction with five modality-aware prompting strategies
Entities
Institutions
- arXiv