Few-Shot Benign DPO Attack Jailbreaks LLMs

ai-technology · 2026-05-13

A novel attack technique takes advantage of Direct Preference Optimization (DPO) to jailbreak large language models (LLMs) utilizing merely 10 innocuous preference pairs, which is the smallest data scale permitted by OpenAI's fine-tuning service. Researchers reveal that DPO creates a more robust and less auditable failure mode in comparison to supervised fine-tuning (SFT). The method employs harmless prompts, designating a typical helpful response as preferred and a refusal as dispreferred, rendering the data indistinguishable from authentic requests to minimize over-refusal. This genuinely harmless attack presents considerable safety concerns for fine-tuning pipelines that rely on preference-based objectives.

Key facts

Attack uses only 10 harmless preference pairs
Minimum data scale accepted by OpenAI's fine-tuning service
DPO introduces stronger and harder-to-audit failure mode than SFT
Data is indistinguishable from legitimate user requests
Benign prompts with helpful answer as preferred and refusal as dispreferred
Prior work showed benign SFT can reduce refusal behavior
Deployed fine-tuning pipelines increasingly support preference-based objectives
Safety risks of preference-based fine-tuning remain less understood

Few-Shot Benign DPO Attack Jailbreaks LLMs

Key facts

Entities

Institutions

Sources