ARTFEED — Contemporary Art Intelligence

FlipAttack Jailbreaks Black-Box LLMs with 98% Success Rate on GPT-4o

ai-technology · 2026-05-18

Researchers propose FlipAttack, a simple jailbreak method exploiting autoregressive LLMs' left-to-right comprehension weakness. By adding noise to the left side of harmful prompts, then flipping text in four modes, the attack achieves ~98% success on GPT-4o and other models in a single query. The method is universal, stealthy, and requires no model access.

Key facts

  • FlipAttack targets black-box LLMs
  • Exploits autoregressive left-to-right text understanding
  • Uses left-side noise and four flipping modes
  • Achieves ~98% attack success rate on GPT-4o
  • Requires only one query
  • Tested on 8 LLMs
  • Method is universal and stealthy
  • Published on arXiv: 2410.02832

Entities

Institutions

  • arXiv

Sources