FlipAttack Jailbreaks Black-Box LLMs with 98% Success Rate on GPT-4o
Researchers propose FlipAttack, a simple jailbreak method exploiting autoregressive LLMs' left-to-right comprehension weakness. By adding noise to the left side of harmful prompts, then flipping text in four modes, the attack achieves ~98% success on GPT-4o and other models in a single query. The method is universal, stealthy, and requires no model access.
Key facts
- FlipAttack targets black-box LLMs
- Exploits autoregressive left-to-right text understanding
- Uses left-side noise and four flipping modes
- Achieves ~98% attack success rate on GPT-4o
- Requires only one query
- Tested on 8 LLMs
- Method is universal and stealthy
- Published on arXiv: 2410.02832
Entities
Institutions
- arXiv