Reflector: A Two-Stage Framework for LLM Jailbreak Defense
A new framework called Reflector aims to defend large language models (LLMs) against indirect jailbreak attacks. The system uses a two-stage process: first, teacher-guided generation creates high-quality reflection data for supervised fine-tuning (SFT), then reinforcement learning (RL) with outcome-driven and reward-validity supervision instills autonomous self-reflection. Empirical results show Defense Success Rates (DSR) exceeding 90% against complex indirect attacks, with robust generalization across diverse threat scenarios. The framework addresses vulnerabilities in LLMs that circumvent surface-level safety alignment by exploiting internal generation processes. The paper is available on arXiv (2605.20654).
Key facts
- Reflector is a two-stage framework for LLM jailbreak defense
- First stage uses teacher-guided generation for SFT
- Second stage uses RL with outcome-driven and reward-validity supervision
- Achieves DSR exceeding 90% against indirect attacks
- Generalizes robustly across diverse threat scenarios
- Addresses vulnerabilities in LLM internal generation process
- Paper available on arXiv: 2605.20654
- Published as arXiv preprint
Entities
Institutions
- arXiv