One-Shot Defense Against Many-Shot Jailbreak Attacks on LLMs
A recent study published on arXiv (2605.08277) indicates that many-shot jailbreaking (MSJ) attacks on language models designed for safety become more potent as the quantity of harmful examples increases, a phenomenon known as progressive activation drift. As additional harmful demonstrations are introduced, the representation of a specific harmful query gradually shifts away from the safety-aligned area. This drift can be theoretically understood as a form of implicit malicious fine-tuning, where conditioning on N harmful examples results in SGD-style updates akin to optimizing for those N samples. This understanding transforms the attack mechanism into a defensive strategy. By including a single one-shot safety demonstration during inference, the approach prompts a safety-focused update, enhancing the model's resilience against MSJ attacks.
Key facts
- Many-shot jailbreaking (MSJ) attacks cause safety-aligned language models to answer harmful queries by preceding them with many harmful question-answer demonstrations.
- The attack becomes stronger as the number of demonstrations increases.
- MSJ induces progressive activation drift: representation of a fixed harmful query moves step by step away from the safety-aligned region.
- The drift is interpreted as implicit malicious fine-tuning.
- Conditioning on N harmful demonstrations induces SGD-style updates equivalent to optimizing on N harmful samples.
- A fixed one-shot safety demonstration at inference time induces a counteracting safety-oriented update.
- The method restores refusal behavior and improves model robustness.
- The study is published on arXiv with ID 2605.08277.
Entities
Institutions
- arXiv