LLM Jailbreak Metrics Questioned in New Study
A new preprint on arXiv (2605.14418) titled 'The Great Pretender: A Stochasticity Problem in LLM Jailbreak' challenges the reliability of Attack Success Rate (ASR) as a benchmark metric for LLM jailbreak attacks. The authors note that methods from reputable institutions like Anthropic's BoN or Microsoft Research's Crescendo often claim high ASR scores against industry-grade LLMs, but these scores do not reflect real-world performance. For instance, a jailbreak prompt may achieve an 80% ASR on paper against a closed-source model with guardrails, yet only succeed 50% of the time (5 out of 10 attempts) against an open target model. The study argues that ASR is not a stable quantity, highlighting a stochasticity problem in jailbreak creation and evaluation.
Key facts
- Preprint arXiv:2605.14418 questions LLM jailbreak metrics
- Attack Success Rate (ASR) found to be unstable
- Example: 80% ASR on paper vs 50% consecutive success in practice
- Methods from Anthropic (BoN) and Microsoft Research (Crescendo) cited
- Study focuses on stochasticity in jailbreak evaluation
Entities
Institutions
- Anthropic
- Microsoft Research
- arXiv