SALO: New AI Jailbreak Detection Method Exploits Latent Refusal Trajectories

ai-technology · 2026-05-07

A team of researchers has introduced SALO (Sparse Activation Localization Operator), an innovative detector designed for identifying AI jailbreak attacks during inference. Unlike conventional representation engineering, which depends on fixed refusal vectors derived from terminal representations, this approach views refusal as a dynamic and sparse phenomenon. By employing Causal Tracing, the researchers identified a 'Refusal Trajectory'—a consistent upstream signature that persists even when adversarial attacks like GCG diminish terminal signals. SALO effectively captures these underlying patterns, enhancing defense mechanisms against forced-decoding attacks and boosting detection rates from nearly 0% to over 90% in situations where terminal state-dependent methods fall short. The research is published on arXiv in the fields of computer science, cryptography, and security.

Key facts

SALO is an inference-time jailbreak detector
Refusal is treated as a dynamic and sparse process
Causal Tracing reveals a persistent upstream Refusal Trajectory
Adversarial attacks like GCG can suppress terminal refusal signals
SALO improves detection rates from ~0% to >90%
The method recovers defense against forced-decoding attacks
The paper is on arXiv under cs.CR
Representation Engineering typically uses static refusal vectors

SALO: New AI Jailbreak Detection Method Exploits Latent Refusal Trajectories

Key facts

Entities

Institutions

Sources