FlipAttack Jailbreaks Black-Box LLMs with 98% Success Rate on GPT-4o

ai-technology · 2026-05-18

Researchers propose FlipAttack, a simple jailbreak method exploiting autoregressive LLMs' left-to-right comprehension weakness. By adding noise to the left side of harmful prompts, then flipping text in four modes, the attack achieves ~98% success on GPT-4o and other models in a single query. The method is universal, stealthy, and requires no model access.