ARTFEED — Contemporary Art Intelligence

Few-Shot Benign DPO Attack Jailbreaks LLMs

ai-technology · 2026-05-13

A novel attack technique takes advantage of Direct Preference Optimization (DPO) to jailbreak large language models (LLMs) utilizing merely 10 innocuous preference pairs, which is the smallest data scale permitted by OpenAI's fine-tuning service. Researchers reveal that DPO creates a more robust and less auditable failure mode in comparison to supervised fine-tuning (SFT). The method employs harmless prompts, designating a typical helpful response as preferred and a refusal as dispreferred, rendering the data indistinguishable from authentic requests to minimize over-refusal. This genuinely harmless attack presents considerable safety concerns for fine-tuning pipelines that rely on preference-based objectives.

Key facts

  • Attack uses only 10 harmless preference pairs
  • Minimum data scale accepted by OpenAI's fine-tuning service
  • DPO introduces stronger and harder-to-audit failure mode than SFT
  • Data is indistinguishable from legitimate user requests
  • Benign prompts with helpful answer as preferred and refusal as dispreferred
  • Prior work showed benign SFT can reduce refusal behavior
  • Deployed fine-tuning pipelines increasingly support preference-based objectives
  • Safety risks of preference-based fine-tuning remain less understood

Entities

Institutions

  • OpenAI

Sources