LLM Activation Analysis Detects Multi-Turn Prompt Injection Attacks
A novel detection technique has been developed to identify multi-turn prompt injection attacks on large language models by examining activation patterns within the residual stream. Researchers found that attack pathways—trust-building, pivoting, and escalation—generate a quantifiable 'adversarial restlessness,' with path lengths significantly longer than those of normal conversations. By utilizing five scalar trajectory features, conversation-level detection improved from 76.2% to 93.8% on synthetic held-out data. This signal was consistent across four model families (24B-70B), although probes were specific to each model and did not transfer between architectures. Generalization was influenced by source distribution: leave-one-source-out assessments indicated that synthetic, LMSYS-Chat-1M, and SafeDialBench each represented unique attack distributions, with LMSYS detection achieving 47-71% in real-world scenarios when its distribution was included in training. The findings are documented in arXiv paper 2604.28129.
Key facts
- Multi-turn prompt injection follows a known attack path: trust-building, pivoting, escalation.
- Text-level defenses miss covert attacks where individual turns appear benign.
- The attack path leaves an activation-level signature in the model's residual stream.
- Each phase shift moves the activation, producing a total path length far exceeding benign conversations.
- This phenomenon is called 'adversarial restlessness'.
- Five scalar trajectory features improved detection from 76.2% to 93.8% on synthetic held-out data.
- The signal replicates across four model families (24B-70B).
- Probes are model-specific and do not transfer across architectures.
- Generalization is source-dependent: leave-one-source-out evaluation shows synthetic, LMSYS-Chat-1M, and SafeDialBench capture distinct attack distributions.
- Detection on real-world LMSYS reaches 47-71% when its distribution is represented in training.
- The paper is published on arXiv with ID 2604.28129.
Entities
Institutions
- arXiv