Semantic Hijacking Attack Exploits Smarter AI Agents
A new study reveals that multi-agent systems using large language models (LLMs) become less secure as their individual agents grow more capable. Researchers identified "semantic hijacking," an attack where harmful requests are hidden within domain-specific narratives and passed from Worker agents to a Manager agent without syntactic injection. In 42,000 adversarial trials across 12 Manager models and 7 Worker configurations, the mean system-level Attack Success Rate (ASR) rose from 18.4% to 63.9% as Worker capability increased, peaking at 94.4%. Multi-level mediation analysis on 47,807 interactions from two datasets showed this paradox is driven by "linguistic certainty": stronger Workers interpret adversarial narratives as legitimate and convey conclusions more assertively. The study is published on arXiv (2605.17480).
Key facts
- Multi-agent systems extend LLMs by decomposing tasks among specialized agents.
- Semantic hijacking conceals harmful requests in domain-specific narratives.
- Attack does not require syntactic injection primitives.
- 42,000 adversarial trials conducted over 12 Manager models and 7 Worker configurations.
- Mean ASR increased from 18.4% to 63.9% as Worker capability increased.
- Peak ASR reached 94.4%.
- Multi-level mediation analysis performed on 47,807 interactions from two datasets.
- Stronger Workers exhibit higher linguistic certainty, interpreting adversarial narratives as legitimate.
Entities
Institutions
- arXiv