Trojan Hippo Attack Exploits LLM Memory for Data Theft

ai-technology · 2026-05-06

A new study has identified the Trojan Hippo attack, a type of persistent memory assault aimed at LLM agents. Distinct from earlier memory poisoning techniques, this attack employs a more plausible threat model: an attacker embeds a hidden payload into an agent's long-term memory through a single call to an untrusted tool, like a manipulated email. This payload activates when the user engages in discussions about sensitive subjects such as finance, health, or identity, allowing the attacker to extract valuable personal information. Although anecdotal evidence has surfaced against existing systems, previous research has not thoroughly examined these attacks across various memory architectures and defenses. The researchers present a dynamic evaluation framework, featuring an OpenEvolve-based adaptive red-teaming benchmark to rigorously test defenses and memory backends. This research is available on arXiv with the identifier 2605.01970.

Key facts

Trojan Hippo attack is a class of persistent memory attacks on LLM agents.
Attack plants dormant payload via single untrusted tool call (e.g., crafted email).
Payload activates when user discusses sensitive topics (finance, health, identity).
Attack exfiltrates high-value personal data to attacker.
Prior work lacked systematic evaluation across memory architectures and defenses.
New dynamic evaluation framework includes OpenEvolve-based adaptive red-teaming benchmark.
Research published on arXiv (2605.01970).
Attack operates under more realistic threat model than prior memory poisoning.

Trojan Hippo Attack Exploits LLM Memory for Data Theft

Key facts

Entities

Institutions

Sources