Memory Poisoning Masquerades as Model Failure in AI Agents

ai-technology · 2026-05-25

A recent study published on arXiv (2605.22842) uncovers a significant security vulnerability in multi-agent AI systems, termed the Misattribution Gap. The researchers introduce the concept of Semantic Norm Drift (SND), which describes how policy documents stored in shared vector repositories lose their original source due to a Trust Laundering Chain, subsequently being perceived as reliable context by the system. This misattribution leads to agent behaviors that mimic model misalignment, causing defenders to incorrectly identify the issue. Out of 64 observed failures, attribution systems repeatedly pointed to the model as the source. Notably, four safety classifiers, including one focused on memory poisoning, failed to detect any issues across 510 checkpoints. In 59 of 65 valid instances, agents referenced the injected document as a normative source before following it. The findings challenge the belief that agent misconduct stems solely from model misalignment, highlighting a fundamental weakness in memory-layer security.

Key facts

arXiv paper 2605.22842 identifies the Misattribution Gap in multi-agent AI systems.
Semantic Norm Drift (SND) is formalized as a third path to agent misconduct.
Memory-layer attacks produce behaviors indistinguishable from model failure.
Policy documents lose provenance through a Trust Laundering Chain.
64 documented failures were all misattributed to model failure by attribution systems.
Four safety classifiers produced zero detections across 510 checkpoints.
In 59 of 65 valid cases, agents cited the injected document as normative authority.
The study challenges the assumption that agent misconduct stems from model misalignment.

Memory Poisoning Masquerades as Model Failure in AI Agents

Key facts

Entities

Institutions

Sources