Adaptive Attacker Breaks Most LLM Prompt Injection Defenses

ai-technology · 2026-04-29

A recent study published on arXiv indicates that the majority of strategies designed to counter prompt injection in large language models are largely ineffective. Researchers developed an adaptive attacker that progressed through hundreds of rounds, evaluating nine defense configurations against more than 20,000 attacks. All defenses that depended on the model's self-protection ultimately failed. The sole effective strategy was output filtering, which employs hardcoded rules in distinct application code to vet responses prior to user delivery, resulting in no leaks across 15,000 attacks. The findings suggest that security measures should be implemented in application code rather than relying on the model. Until defenses are validated by tools like Swept AI, AI systems managing sensitive tasks should only be accessible to trusted internal personnel.

Key facts

Adaptive attacker evolved strategies over hundreds of rounds
Nine defense configurations tested across over 20,000 attacks
All defenses relying on the model to protect itself eventually broke
Output filtering achieved zero leaks across 15,000 attacks
Security boundaries must be enforced in application code
Swept AI mentioned as a verification tool
Recommendation to restrict AI systems to internal trusted personnel

Adaptive Attacker Breaks Most LLM Prompt Injection Defenses

Key facts

Entities

Institutions

Sources