Pre-Model Safeguard Using Draft Models for LLM Jailbreak Defense
A new safeguard design leverages jailbreak attack transferability from large language models (LLMs) to small language models (SLMs) to enforce prompt safety before target model inference. The approach aims to reduce false-negative rates common in pre-model guards and avoid the high computational cost of post-model guards. The study systematically examines jailbreak transferability, identifying key factors that influence it. The paper is published on arXiv under identifier 2605.19321.
Key facts
- arXiv paper 2605.19321 introduces a pre-model safeguard using draft models.
- The method exploits jailbreak attack transferability from LLMs to SLMs.
- It aims to reduce false-negative rates of pre-model guards.
- It avoids high token usage and processing time of post-model guards.
- A systematic study of jailbreak transferability is conducted.
- Key factors influencing transferability are identified.
- The approach enforces prompt safety before target model inference.
- The paper is classified as a cross-type announcement on arXiv.
Entities
Institutions
- arXiv