Local Causal Explanations for LLM Jailbreak Success

publication · 2026-05-04

A recent paper on arXiv (2605.00123) introduces local causal explanations for the effectiveness of jailbreaks in large language models (LLMs). Unlike previous studies that generally link jailbreaks to a decrease in harmfulness or refusal concepts within intermediate representations, this research contends that various jailbreak techniques may be effective by either enhancing or diminishing specific concepts. Furthermore, it notes that a single strategy may not be universally applicable across different categories of harmful requests, such as violence versus cyberattacks. The authors seek to explain why particular jailbreaks are successful for specific requests, emphasizing the importance of a thorough understanding to safeguard future autonomous frontier models in critical environments.

Key facts

Paper ID: arXiv:2605.00123
Type: New announcement
Focus: Jailbreak success in safety-trained LLMs
Critiques global explanations as insufficient
Proposes local explanations per jailbreak strategy and request category
Distinguishes between violence and cyberattack request categories
Motivation: future autonomous models in high-stakes settings
Prior work used intermediate representations to identify causal directions

Local Causal Explanations for LLM Jailbreak Success

Key facts

Entities

Institutions

Sources