ARTFEED — Contemporary Art Intelligence

LLM Activation Analysis Detects Multi-Turn Prompt Injection Attacks

ai-technology · 2026-05-01

A novel detection technique has been developed to identify multi-turn prompt injection attacks on large language models by examining activation patterns within the residual stream. Researchers found that attack pathways—trust-building, pivoting, and escalation—generate a quantifiable 'adversarial restlessness,' with path lengths significantly longer than those of normal conversations. By utilizing five scalar trajectory features, conversation-level detection improved from 76.2% to 93.8% on synthetic held-out data. This signal was consistent across four model families (24B-70B), although probes were specific to each model and did not transfer between architectures. Generalization was influenced by source distribution: leave-one-source-out assessments indicated that synthetic, LMSYS-Chat-1M, and SafeDialBench each represented unique attack distributions, with LMSYS detection achieving 47-71% in real-world scenarios when its distribution was included in training. The findings are documented in arXiv paper 2604.28129.

Key facts

  • Multi-turn prompt injection follows a known attack path: trust-building, pivoting, escalation.
  • Text-level defenses miss covert attacks where individual turns appear benign.
  • The attack path leaves an activation-level signature in the model's residual stream.
  • Each phase shift moves the activation, producing a total path length far exceeding benign conversations.
  • This phenomenon is called 'adversarial restlessness'.
  • Five scalar trajectory features improved detection from 76.2% to 93.8% on synthetic held-out data.
  • The signal replicates across four model families (24B-70B).
  • Probes are model-specific and do not transfer across architectures.
  • Generalization is source-dependent: leave-one-source-out evaluation shows synthetic, LMSYS-Chat-1M, and SafeDialBench capture distinct attack distributions.
  • Detection on real-world LMSYS reaches 47-71% when its distribution is represented in training.
  • The paper is published on arXiv with ID 2604.28129.

Entities

Institutions

  • arXiv

Sources