ARTFEED — Contemporary Art Intelligence

LLM Jailbreak Metrics Questioned in New Study

other · 2026-05-16

A new preprint on arXiv (2605.14418) titled 'The Great Pretender: A Stochasticity Problem in LLM Jailbreak' challenges the reliability of Attack Success Rate (ASR) as a benchmark metric for LLM jailbreak attacks. The authors note that methods from reputable institutions like Anthropic's BoN or Microsoft Research's Crescendo often claim high ASR scores against industry-grade LLMs, but these scores do not reflect real-world performance. For instance, a jailbreak prompt may achieve an 80% ASR on paper against a closed-source model with guardrails, yet only succeed 50% of the time (5 out of 10 attempts) against an open target model. The study argues that ASR is not a stable quantity, highlighting a stochasticity problem in jailbreak creation and evaluation.

Key facts

  • Preprint arXiv:2605.14418 questions LLM jailbreak metrics
  • Attack Success Rate (ASR) found to be unstable
  • Example: 80% ASR on paper vs 50% consecutive success in practice
  • Methods from Anthropic (BoN) and Microsoft Research (Crescendo) cited
  • Study focuses on stochasticity in jailbreak evaluation

Entities

Institutions

  • Anthropic
  • Microsoft Research
  • arXiv

Sources