Reflector: A Two-Stage Framework for LLM Jailbreak Defense

ai-technology · 2026-05-22

A new framework called Reflector aims to defend large language models (LLMs) against indirect jailbreak attacks. The system uses a two-stage process: first, teacher-guided generation creates high-quality reflection data for supervised fine-tuning (SFT), then reinforcement learning (RL) with outcome-driven and reward-validity supervision instills autonomous self-reflection. Empirical results show Defense Success Rates (DSR) exceeding 90% against complex indirect attacks, with robust generalization across diverse threat scenarios. The framework addresses vulnerabilities in LLMs that circumvent surface-level safety alignment by exploiting internal generation processes. The paper is available on arXiv (2605.20654).

Key facts

Reflector is a two-stage framework for LLM jailbreak defense
First stage uses teacher-guided generation for SFT
Second stage uses RL with outcome-driven and reward-validity supervision
Achieves DSR exceeding 90% against indirect attacks
Generalizes robustly across diverse threat scenarios
Addresses vulnerabilities in LLM internal generation process
Paper available on arXiv: 2605.20654
Published as arXiv preprint

Reflector: A Two-Stage Framework for LLM Jailbreak Defense

Key facts

Entities

Institutions

Sources