LLM Jailbreak Metrics Questioned in New Study

other · 2026-05-16

A new preprint on arXiv (2605.14418) titled 'The Great Pretender: A Stochasticity Problem in LLM Jailbreak' challenges the reliability of Attack Success Rate (ASR) as a benchmark metric for LLM jailbreak attacks. The authors note that methods from reputable institutions like Anthropic's BoN or Microsoft Research's Crescendo often claim high ASR scores against industry-grade LLMs, but these scores do not reflect real-world performance. For instance, a jailbreak prompt may achieve an 80% ASR on paper against a closed-source model with guardrails, yet only succeed 50% of the time (5 out of 10 attempts) against an open target model. The study argues that ASR is not a stable quantity, highlighting a stochasticity problem in jailbreak creation and evaluation.

Key facts

Preprint arXiv:2605.14418 questions LLM jailbreak metrics
Attack Success Rate (ASR) found to be unstable
Example: 80% ASR on paper vs 50% consecutive success in practice
Methods from Anthropic (BoN) and Microsoft Research (Crescendo) cited
Study focuses on stochasticity in jailbreak evaluation

LLM Jailbreak Metrics Questioned in New Study

Key facts

Entities

Institutions

Sources