Advanced Jailbreaks Preserve Frontier AI Model Capabilities
A new study on arXiv reveals that the most sophisticated jailbreaks impose negligible performance degradation on advanced language models. Researchers evaluated 28 jailbreaks across five benchmarks on Claude models from Haiku 4.5 to Opus 4.6. Haiku 4.5 lost an average of 33.1% benchmark performance when jailbroken, while Opus 4.6 at maximum thinking effort lost only 7.7%. Reasoning-heavy tasks showed more degradation than knowledge-recall tasks. Boundary Point Jailbreaking achieved near-perfect classifier evasion with near-zero degradation. The findings challenge the notion of a universal 'jailbreak tax' and suggest that as model capability increases, the tax diminishes.
Key facts
- Study on arXiv: 2605.00267
- 28 jailbreaks evaluated on five benchmarks
- Claude models tested: Haiku 4.5 to Opus 4.6
- Haiku 4.5 lost 33.1% benchmark performance when jailbroken
- Opus 4.6 lost only 7.7% at max thinking effort
- Reasoning-heavy tasks degraded more than knowledge-recall
- Boundary Point Jailbreaking achieved near-perfect evasion with near-zero degradation
- Jailbreak tax scales inversely with model capability
Entities
Institutions
- arXiv