ARTFEED — Contemporary Art Intelligence

Sleeper Attack: Persistent Adversarial Threats to LLM Agents

ai-technology · 2026-05-28

A recent study published on arXiv (2605.28201) presents the concept of 'Sleeper Attack,' a safety concern where harmful content lingers through various interactions with Large Language Model (LLM) agents. In contrast to attacks that manifest harmful behavior in a single interaction, this adversarial content can lie inactive within the agent's state and be activated by a harmless user prompt at a later time. The researchers developed a benchmark consisting of 1,896 examples that illustrate six real-world harmful consequences to assess this risk. This research underscores a new vulnerability in LLM agents, complicating efforts for detection and mitigation.

Key facts

  • Sleeper Attack is a persistent adversarial threat to LLM agents.
  • Adversarial content can persist across interactions served by the same agent.
  • Content remains dormant and is activated by a benign user query.
  • Benchmark includes 1,896 instances covering six harmful outcomes.
  • Study published on arXiv with identifier 2605.28201.
  • Threat is harder to detect than single-interaction attacks.
  • Attack targets external observations like tool-returned data or webpages.
  • Adversarial content can be injected into MCP context.

Entities

Institutions

  • arXiv

Sources