ARTFEED — Contemporary Art Intelligence

SALO: New AI Jailbreak Detection Method Exploits Latent Refusal Trajectories

ai-technology · 2026-05-07

A team of researchers has introduced SALO (Sparse Activation Localization Operator), an innovative detector designed for identifying AI jailbreak attacks during inference. Unlike conventional representation engineering, which depends on fixed refusal vectors derived from terminal representations, this approach views refusal as a dynamic and sparse phenomenon. By employing Causal Tracing, the researchers identified a 'Refusal Trajectory'—a consistent upstream signature that persists even when adversarial attacks like GCG diminish terminal signals. SALO effectively captures these underlying patterns, enhancing defense mechanisms against forced-decoding attacks and boosting detection rates from nearly 0% to over 90% in situations where terminal state-dependent methods fall short. The research is published on arXiv in the fields of computer science, cryptography, and security.

Key facts

  • SALO is an inference-time jailbreak detector
  • Refusal is treated as a dynamic and sparse process
  • Causal Tracing reveals a persistent upstream Refusal Trajectory
  • Adversarial attacks like GCG can suppress terminal refusal signals
  • SALO improves detection rates from ~0% to >90%
  • The method recovers defense against forced-decoding attacks
  • The paper is on arXiv under cs.CR
  • Representation Engineering typically uses static refusal vectors

Entities

Institutions

  • arXiv

Sources