OEP Attack Poisons Self-Evolving LLM Agents via Clean Experiences
Researchers have identified a new security vulnerability in memory-augmented large language model (LLM) agents that use iterative reflection and self-evolution. The attack, named Obsessive Experience Poisoning (OEP), exploits the agent's ability to generate and learn from its own experiences. Unlike previous attacks that require privileged access or explicit malicious content, OEP is a low-privilege black-box attack that constructs adversarial clean edge-cases. These edge-cases combine locally correct solutions with severe but plausible hypothetical consequences, leading the agent to generalize harmfully during reflection. The attack does not require direct control over the system prompt or memory database, making it stealthy and difficult to detect. The findings were published on arXiv with the identifier 2605.18930.
Key facts
- OEP is a low-privilege black-box attack on self-evolving LLM agents.
- The attack uses clean experiences that are locally correct but induce harmful generalization.
- It requires no direct control over system prompt or memory database.
- The attack exploits iterative reflection and self-evolution mechanisms.
- Previous attacks required privileged access or explicit malicious content.
- The paper is available on arXiv under identifier 2605.18930.
- The attack combines locally correct solutions with hypothetical consequences.
- It targets memory-augmented LLM agents.
Entities
Institutions
- arXiv