Advanced Jailbreaks Preserve Frontier AI Model Capabilities

ai-technology · 2026-05-04

A new study on arXiv reveals that the most sophisticated jailbreaks impose negligible performance degradation on advanced language models. Researchers evaluated 28 jailbreaks across five benchmarks on Claude models from Haiku 4.5 to Opus 4.6. Haiku 4.5 lost an average of 33.1% benchmark performance when jailbroken, while Opus 4.6 at maximum thinking effort lost only 7.7%. Reasoning-heavy tasks showed more degradation than knowledge-recall tasks. Boundary Point Jailbreaking achieved near-perfect classifier evasion with near-zero degradation. The findings challenge the notion of a universal 'jailbreak tax' and suggest that as model capability increases, the tax diminishes.

Key facts

Study on arXiv: 2605.00267
28 jailbreaks evaluated on five benchmarks
Claude models tested: Haiku 4.5 to Opus 4.6
Haiku 4.5 lost 33.1% benchmark performance when jailbroken
Opus 4.6 lost only 7.7% at max thinking effort
Reasoning-heavy tasks degraded more than knowledge-recall
Boundary Point Jailbreaking achieved near-perfect evasion with near-zero degradation
Jailbreak tax scales inversely with model capability

Advanced Jailbreaks Preserve Frontier AI Model Capabilities

Key facts

Entities

Institutions

Sources