RouteGuard Detects Skill Poisoning in LLM Agents via Attention Hijacking

ai-technology · 2026-04-29

A new arXiv preprint (2604.22888) introduces RouteGuard, a detection method for skill poisoning in LLM agents. Unlike traditional indirect prompt injection, skill poisoning hides malicious instructions inside legitimate action-oriented skills. The authors identify attention hijacking as the underlying mechanism, where response-time attention shifts from trusted context to malicious skill spans. RouteGuard is a frozen-backbone detector combining response-conditioned attention and hidden-state alignment through reliability-gated late fusion. Evaluated on real and synthetic open-source skill benchmarks, it achieves 0.8834 F1 on the critical Skill-Inject channel slice and recovers 90.51% of description accuracy, consistently outperforming or matching the strongest detectors.

Key facts

RouteGuard detects skill poisoning in LLM agents
Skill poisoning is a new form of indirect injection
Attackers hide malicious instructions in action-oriented skills
Attention hijacking is the internal effect exploited by poisoning
RouteGuard uses response-conditioned attention and hidden-state alignment
It employs reliability-gated late fusion
Evaluated on real and synthetic open-source skill benchmarks
Achieves 0.8834 F1 on Skill-Inject channel slice
Recovers 90.51% of description accuracy
Published on arXiv with ID 2604.22888

RouteGuard Detects Skill Poisoning in LLM Agents via Attention Hijacking

Key facts

Entities

Institutions

Sources