Sleeper Attack: Persistent Adversarial Threats to LLM Agents

ai-technology · 2026-05-28

A recent study published on arXiv (2605.28201) presents the concept of 'Sleeper Attack,' a safety concern where harmful content lingers through various interactions with Large Language Model (LLM) agents. In contrast to attacks that manifest harmful behavior in a single interaction, this adversarial content can lie inactive within the agent's state and be activated by a harmless user prompt at a later time. The researchers developed a benchmark consisting of 1,896 examples that illustrate six real-world harmful consequences to assess this risk. This research underscores a new vulnerability in LLM agents, complicating efforts for detection and mitigation.

Key facts

Sleeper Attack is a persistent adversarial threat to LLM agents.
Adversarial content can persist across interactions served by the same agent.
Content remains dormant and is activated by a benign user query.
Benchmark includes 1,896 instances covering six harmful outcomes.
Study published on arXiv with identifier 2605.28201.
Threat is harder to detect than single-interaction attacks.
Attack targets external observations like tool-returned data or webpages.
Adversarial content can be injected into MCP context.

Sleeper Attack: Persistent Adversarial Threats to LLM Agents

Key facts

Entities

Institutions

Sources