RouteGuard Detects Skill Poisoning in LLM Agents via Attention Hijacking
A new arXiv preprint (2604.22888) introduces RouteGuard, a detection method for skill poisoning in LLM agents. Unlike traditional indirect prompt injection, skill poisoning hides malicious instructions inside legitimate action-oriented skills. The authors identify attention hijacking as the underlying mechanism, where response-time attention shifts from trusted context to malicious skill spans. RouteGuard is a frozen-backbone detector combining response-conditioned attention and hidden-state alignment through reliability-gated late fusion. Evaluated on real and synthetic open-source skill benchmarks, it achieves 0.8834 F1 on the critical Skill-Inject channel slice and recovers 90.51% of description accuracy, consistently outperforming or matching the strongest detectors.
Key facts
- RouteGuard detects skill poisoning in LLM agents
- Skill poisoning is a new form of indirect injection
- Attackers hide malicious instructions in action-oriented skills
- Attention hijacking is the internal effect exploited by poisoning
- RouteGuard uses response-conditioned attention and hidden-state alignment
- It employs reliability-gated late fusion
- Evaluated on real and synthetic open-source skill benchmarks
- Achieves 0.8834 F1 on Skill-Inject channel slice
- Recovers 90.51% of description accuracy
- Published on arXiv with ID 2604.22888
Entities
Institutions
- arXiv