Survival Analysis Framework Quantifies LLM Safety Degradation Under Repeated Attacks
A recent study published on arXiv introduces a framework for survival analysis aimed at assessing the vulnerability of LLM jailbreaks over time, shifting away from simple success or failure metrics. This research treats the time-to-jailbreak as a survival outcome, allowing for the estimation of hazard functions, survival curves, and associated risk factors. The analysis involved three LLMs tested with a selection of HarmBench prompts across three different attack categories, uncovering unique vulnerability patterns, particularly highlighting a quick decline under iterative attacks.
Key facts
- arXiv paper 2605.12869 proposes survival analysis for LLM safety evaluation.
- Framework models time-to-jailbreak as a survival outcome.
- Estimates hazard functions, survival curves, and risk factors.
- Evaluates three LLMs on HarmBench prompts across three attack categories.
- Models show distinct vulnerability profiles, with one degrading rapidly under iterative attacks.
- Existing frameworks report binary success/failure metrics, missing temporal dynamics.
- The work is preliminary and focuses on adversarial jailbreak attacks.
- LLMs remain vulnerable to attacks that circumvent safety guardrails.
Entities
Institutions
- arXiv
- HarmBench