ARTFEED — Contemporary Art Intelligence

Reflector: A Two-Stage Framework for LLM Jailbreak Defense

ai-technology · 2026-05-22

A new framework called Reflector aims to defend large language models (LLMs) against indirect jailbreak attacks. The system uses a two-stage process: first, teacher-guided generation creates high-quality reflection data for supervised fine-tuning (SFT), then reinforcement learning (RL) with outcome-driven and reward-validity supervision instills autonomous self-reflection. Empirical results show Defense Success Rates (DSR) exceeding 90% against complex indirect attacks, with robust generalization across diverse threat scenarios. The framework addresses vulnerabilities in LLMs that circumvent surface-level safety alignment by exploiting internal generation processes. The paper is available on arXiv (2605.20654).

Key facts

  • Reflector is a two-stage framework for LLM jailbreak defense
  • First stage uses teacher-guided generation for SFT
  • Second stage uses RL with outcome-driven and reward-validity supervision
  • Achieves DSR exceeding 90% against indirect attacks
  • Generalizes robustly across diverse threat scenarios
  • Addresses vulnerabilities in LLM internal generation process
  • Paper available on arXiv: 2605.20654
  • Published as arXiv preprint

Entities

Institutions

  • arXiv

Sources