ARTFEED — Contemporary Art Intelligence

RouteGuard Detects Skill Poisoning in LLM Agents via Attention Hijacking

ai-technology · 2026-04-29

A new arXiv preprint (2604.22888) introduces RouteGuard, a detection method for skill poisoning in LLM agents. Unlike traditional indirect prompt injection, skill poisoning hides malicious instructions inside legitimate action-oriented skills. The authors identify attention hijacking as the underlying mechanism, where response-time attention shifts from trusted context to malicious skill spans. RouteGuard is a frozen-backbone detector combining response-conditioned attention and hidden-state alignment through reliability-gated late fusion. Evaluated on real and synthetic open-source skill benchmarks, it achieves 0.8834 F1 on the critical Skill-Inject channel slice and recovers 90.51% of description accuracy, consistently outperforming or matching the strongest detectors.

Key facts

  • RouteGuard detects skill poisoning in LLM agents
  • Skill poisoning is a new form of indirect injection
  • Attackers hide malicious instructions in action-oriented skills
  • Attention hijacking is the internal effect exploited by poisoning
  • RouteGuard uses response-conditioned attention and hidden-state alignment
  • It employs reliability-gated late fusion
  • Evaluated on real and synthetic open-source skill benchmarks
  • Achieves 0.8834 F1 on Skill-Inject channel slice
  • Recovers 90.51% of description accuracy
  • Published on arXiv with ID 2604.22888

Entities

Institutions

  • arXiv

Sources