One-Shot Defense Against Many-Shot Jailbreak Attacks on LLMs

ai-technology · 2026-05-12

A recent study published on arXiv (2605.08277) indicates that many-shot jailbreaking (MSJ) attacks on language models designed for safety become more potent as the quantity of harmful examples increases, a phenomenon known as progressive activation drift. As additional harmful demonstrations are introduced, the representation of a specific harmful query gradually shifts away from the safety-aligned area. This drift can be theoretically understood as a form of implicit malicious fine-tuning, where conditioning on N harmful examples results in SGD-style updates akin to optimizing for those N samples. This understanding transforms the attack mechanism into a defensive strategy. By including a single one-shot safety demonstration during inference, the approach prompts a safety-focused update, enhancing the model's resilience against MSJ attacks.

Key facts

Many-shot jailbreaking (MSJ) attacks cause safety-aligned language models to answer harmful queries by preceding them with many harmful question-answer demonstrations.
The attack becomes stronger as the number of demonstrations increases.
MSJ induces progressive activation drift: representation of a fixed harmful query moves step by step away from the safety-aligned region.
The drift is interpreted as implicit malicious fine-tuning.
Conditioning on N harmful demonstrations induces SGD-style updates equivalent to optimizing on N harmful samples.
A fixed one-shot safety demonstration at inference time induces a counteracting safety-oriented update.
The method restores refusal behavior and improves model robustness.
The study is published on arXiv with ID 2605.08277.

One-Shot Defense Against Many-Shot Jailbreak Attacks on LLMs

Key facts

Entities

Institutions

Sources