ARTFEED — Contemporary Art Intelligence

RouteHijack: New Attack Targets MoE LLM Safety

ai-technology · 2026-05-07

Researchers have developed RouteHijack, a novel adversarial attack targeting Mixture-of-Experts (MoE) large language models. Unlike prior methods that rely on prompt engineering or internal model access, RouteHijack exploits the routing mechanism unique to MoE architectures. The attack works by optimizing input tokens to influence which experts are activated, steering the model toward unsafe outputs. The key insight is that safety-related behavior is concentrated in a small subset of experts, making MoE models vulnerable to routing manipulation. RouteHijack first performs response analysis to identify safety-critical experts, then crafts inputs that bypass them. This approach overcomes limitations of existing jailbreaks, which are either heuristic, require privileged access, or fail due to non-differentiable routing. The paper, published on arXiv (2605.02946), highlights a fundamental security challenge as MoE models become more prevalent.

Key facts

  • RouteHijack is a routing-aware jailbreak for MoE LLMs
  • It exploits the concentration of safety behavior in a subset of experts
  • The attack uses input optimization to influence routing decisions
  • It overcomes limitations of prompt-based and model intervention methods
  • The paper is published on arXiv with ID 2605.02946
  • MoE architectures are increasingly adopted to scale model capacity
  • Safety alignment is critical for responsible LLM deployment
  • Existing attacks are either heuristic, require privileged access, or fail due to non-differentiable routing

Entities

Institutions

  • arXiv

Sources