Local Causal Explanations for LLM Jailbreak Success
A recent paper on arXiv (2605.00123) introduces local causal explanations for the effectiveness of jailbreaks in large language models (LLMs). Unlike previous studies that generally link jailbreaks to a decrease in harmfulness or refusal concepts within intermediate representations, this research contends that various jailbreak techniques may be effective by either enhancing or diminishing specific concepts. Furthermore, it notes that a single strategy may not be universally applicable across different categories of harmful requests, such as violence versus cyberattacks. The authors seek to explain why particular jailbreaks are successful for specific requests, emphasizing the importance of a thorough understanding to safeguard future autonomous frontier models in critical environments.
Key facts
- Paper ID: arXiv:2605.00123
- Type: New announcement
- Focus: Jailbreak success in safety-trained LLMs
- Critiques global explanations as insufficient
- Proposes local explanations per jailbreak strategy and request category
- Distinguishes between violence and cyberattack request categories
- Motivation: future autonomous models in high-stakes settings
- Prior work used intermediate representations to identify causal directions
Entities
Institutions
- arXiv