RouteHijack: New Attack Targets MoE LLM Safety

ai-technology · 2026-05-07

Researchers have developed RouteHijack, a novel adversarial attack targeting Mixture-of-Experts (MoE) large language models. Unlike prior methods that rely on prompt engineering or internal model access, RouteHijack exploits the routing mechanism unique to MoE architectures. The attack works by optimizing input tokens to influence which experts are activated, steering the model toward unsafe outputs. The key insight is that safety-related behavior is concentrated in a small subset of experts, making MoE models vulnerable to routing manipulation. RouteHijack first performs response analysis to identify safety-critical experts, then crafts inputs that bypass them. This approach overcomes limitations of existing jailbreaks, which are either heuristic, require privileged access, or fail due to non-differentiable routing. The paper, published on arXiv (2605.02946), highlights a fundamental security challenge as MoE models become more prevalent.

Key facts

RouteHijack is a routing-aware jailbreak for MoE LLMs
It exploits the concentration of safety behavior in a subset of experts
The attack uses input optimization to influence routing decisions
It overcomes limitations of prompt-based and model intervention methods
The paper is published on arXiv with ID 2605.02946
MoE architectures are increasingly adopted to scale model capacity
Safety alignment is critical for responsible LLM deployment
Existing attacks are either heuristic, require privileged access, or fail due to non-differentiable routing

RouteHijack: New Attack Targets MoE LLM Safety

Key facts

Entities

Institutions

Sources