OrchJail: Fuzzing Framework Jailbreaks Tool-Calling T2I Agents
A new fuzzing framework named OrchJail has been created by researchers to facilitate the jailbreaking of text-to-image (T2I) agents through orchestration guidance. While these agents are capable of executing complex multi-step tool chains for generation and editing, this functionality poses a safety risk, as harmless actions can lead to dangerous outcomes when combined. Conventional methods that rely solely on prompts are inadequate to counter these vulnerabilities. OrchJail addresses high-risk orchestration patterns by analyzing successful jailbreak traces and their connections to prompt language. It steers the fuzzing process towards prompts that are more likely to provoke unsafe multi-step behaviors, rather than depending on superficial text alterations. Extensive testing indicates that OrchJail enhances both the effectiveness and efficiency of jailbreaks across various T2I agents. The research is published on arXiv with the identifier 2605.07414.
Key facts
- OrchJail is an orchestration-guided fuzzing framework for jailbreaking tool-calling T2I agents.
- Tool-calling T2I agents can plan and execute multi-step tool chains.
- Harmful outputs may arise from tool orchestration, where benign steps combine into unsafe results.
- Prompt-only jailbreak techniques are insufficient for this new attack surface.
- OrchJail exploits high-risk tool-orchestration patterns.
- It learns from successful jailbreak tool-calling traces and their causal relationships to prompt wording.
- OrchJail guides fuzzing search toward prompts likely to trigger unsafe multi-step tool behaviors.
- Experiments show improved jailbreak effectiveness and efficiency across representative T2I agents.
Entities
Institutions
- arXiv