MT-JailBench: Modular Benchmark for Multi-Turn Jailbreak Attacks on LLMs
A new assessment tool, named MT-JailBench, has been introduced to evaluate jailbreak attacks targeting large language models (LLMs) during extended interactions. These attacks leverage the models' conversational context, resulting in potentially dangerous outputs. Current evaluation methods often lack clarity, making it difficult to discern whether advancements stem from improved attack strategies or merely different evaluation conditions. MT-JailBench comprises five interconnected components: an evaluation function, attack strategies, prompt generation, prompt refinement, and flow control. This design aims to enable consistent comparisons of attack methodologies while enhancing comprehension of multi-turn jailbreak vulnerabilities in LLMs.
Key facts
- MT-JailBench is a modular evaluation framework for multi-turn jailbreak attacks on LLMs.
- Multi-turn jailbreaks exploit conversational context accumulation to steer toward unsafe answers.
- Existing evaluations are black-box with inconsistent budgets, judges, retry rules, and strategy generation.
- The framework implements each attack as five modules: evaluation function, attack strategy, prompt generation, prompt refinement, and flow control.
- It enables fair comparison across attack methods and component-wise analysis.
- The research is published on arXiv with ID 2605.11002.
- The framework aims to standardize evaluation of multi-turn jailbreak attacks.
- The paper is categorized as a cross-type announcement.
Entities
Institutions
- arXiv