Adaptive Attacker Breaks Most LLM Prompt Injection Defenses
A recent study published on arXiv indicates that the majority of strategies designed to counter prompt injection in large language models are largely ineffective. Researchers developed an adaptive attacker that progressed through hundreds of rounds, evaluating nine defense configurations against more than 20,000 attacks. All defenses that depended on the model's self-protection ultimately failed. The sole effective strategy was output filtering, which employs hardcoded rules in distinct application code to vet responses prior to user delivery, resulting in no leaks across 15,000 attacks. The findings suggest that security measures should be implemented in application code rather than relying on the model. Until defenses are validated by tools like Swept AI, AI systems managing sensitive tasks should only be accessible to trusted internal personnel.
Key facts
- Adaptive attacker evolved strategies over hundreds of rounds
- Nine defense configurations tested across over 20,000 attacks
- All defenses relying on the model to protect itself eventually broke
- Output filtering achieved zero leaks across 15,000 attacks
- Security boundaries must be enforced in application code
- Swept AI mentioned as a verification tool
- Recommendation to restrict AI systems to internal trusted personnel
Entities
Institutions
- arXiv
- Swept AI