Sleeper Attack: Persistent Adversarial Threats to LLM Agents
A recent study published on arXiv (2605.28201) presents the concept of 'Sleeper Attack,' a safety concern where harmful content lingers through various interactions with Large Language Model (LLM) agents. In contrast to attacks that manifest harmful behavior in a single interaction, this adversarial content can lie inactive within the agent's state and be activated by a harmless user prompt at a later time. The researchers developed a benchmark consisting of 1,896 examples that illustrate six real-world harmful consequences to assess this risk. This research underscores a new vulnerability in LLM agents, complicating efforts for detection and mitigation.
Key facts
- Sleeper Attack is a persistent adversarial threat to LLM agents.
- Adversarial content can persist across interactions served by the same agent.
- Content remains dormant and is activated by a benign user query.
- Benchmark includes 1,896 instances covering six harmful outcomes.
- Study published on arXiv with identifier 2605.28201.
- Threat is harder to detect than single-interaction attacks.
- Attack targets external observations like tool-returned data or webpages.
- Adversarial content can be injected into MCP context.
Entities
Institutions
- arXiv