Self-Mined Hard Prompts Cut Jailbreak Success but Spike Refusal Rates
A new safety fine-tuning technique for language models has been introduced by researchers. This method evaluates potential prompts based on the frequency of harmful judgments from the target model’s own rollouts and fine-tunes using the most challenging prompts alongside non-jailbroken rollouts. When applied to Llama-3-8B-Instruct and Llama-3.2-3B-Instruct, it lowers the success rate of WildJailbreak attacks from 11.5% and 20.1% to between 1% and 3%. However, it raises the refusal rate for benign prompts resembling jailbreaks from 14-22% to 74-94%. By mixing hard prompts with adversarial benign ones at a 1:1 ratio, refusal rates decrease to 30-51% for 8B and 52-72% for 3B, with a slight reduction in ASR by 2-6 percentage points. Training on the most difficult half of the eligible prompts within this mixed approach further reduces ASR by 35-50%, approximately 3 percentage points. This method is elaborated in a paper available on arXiv (2605.03226).
Key facts
- Method scores prompts by how often target model's own rollouts are judged harmful.
- Fine-tunes on hardest prompts paired with non-jailbroken rollouts.
- Tested on Llama-3-8B-Instruct and Llama-3.2-3B-Instruct.
- WildJailbreak ASR reduced from 11.5% and 20.1% to 1-3%.
- Refusal on jailbreak-shaped benign prompts increased from 14-22% to 74-94%.
- Interleaving hard prompts 1:1 with adversarially-framed benign prompts cuts refusal to 30-51% (8B) and 52-72% (3B).
- Mixed regime costs 2-6 percentage points of ASR.
- Training on hardest half of eligible pool cuts remaining ASR by 35-50%.
Entities
Institutions
- arXiv