ARTFEED — Contemporary Art Intelligence

One-Shot Defense Against Many-Shot Jailbreak Attacks on LLMs

ai-technology · 2026-05-12

A recent study published on arXiv (2605.08277) indicates that many-shot jailbreaking (MSJ) attacks on language models designed for safety become more potent as the quantity of harmful examples increases, a phenomenon known as progressive activation drift. As additional harmful demonstrations are introduced, the representation of a specific harmful query gradually shifts away from the safety-aligned area. This drift can be theoretically understood as a form of implicit malicious fine-tuning, where conditioning on N harmful examples results in SGD-style updates akin to optimizing for those N samples. This understanding transforms the attack mechanism into a defensive strategy. By including a single one-shot safety demonstration during inference, the approach prompts a safety-focused update, enhancing the model's resilience against MSJ attacks.

Key facts

  • Many-shot jailbreaking (MSJ) attacks cause safety-aligned language models to answer harmful queries by preceding them with many harmful question-answer demonstrations.
  • The attack becomes stronger as the number of demonstrations increases.
  • MSJ induces progressive activation drift: representation of a fixed harmful query moves step by step away from the safety-aligned region.
  • The drift is interpreted as implicit malicious fine-tuning.
  • Conditioning on N harmful demonstrations induces SGD-style updates equivalent to optimizing on N harmful samples.
  • A fixed one-shot safety demonstration at inference time induces a counteracting safety-oriented update.
  • The method restores refusal behavior and improves model robustness.
  • The study is published on arXiv with ID 2605.08277.

Entities

Institutions

  • arXiv

Sources