Semantic Hijacking Attack Exploits Smarter AI Agents

ai-technology · 2026-05-20

A new study reveals that multi-agent systems using large language models (LLMs) become less secure as their individual agents grow more capable. Researchers identified "semantic hijacking," an attack where harmful requests are hidden within domain-specific narratives and passed from Worker agents to a Manager agent without syntactic injection. In 42,000 adversarial trials across 12 Manager models and 7 Worker configurations, the mean system-level Attack Success Rate (ASR) rose from 18.4% to 63.9% as Worker capability increased, peaking at 94.4%. Multi-level mediation analysis on 47,807 interactions from two datasets showed this paradox is driven by "linguistic certainty": stronger Workers interpret adversarial narratives as legitimate and convey conclusions more assertively. The study is published on arXiv (2605.17480).

Key facts

Multi-agent systems extend LLMs by decomposing tasks among specialized agents.
Semantic hijacking conceals harmful requests in domain-specific narratives.
Attack does not require syntactic injection primitives.
42,000 adversarial trials conducted over 12 Manager models and 7 Worker configurations.
Mean ASR increased from 18.4% to 63.9% as Worker capability increased.
Peak ASR reached 94.4%.
Multi-level mediation analysis performed on 47,807 interactions from two datasets.
Stronger Workers exhibit higher linguistic certainty, interpreting adversarial narratives as legitimate.

Semantic Hijacking Attack Exploits Smarter AI Agents

Key facts

Entities

Institutions

Sources