ThinkSafe: Self-Generated Safety Alignment for Reasoning Models

other · 2026-05-14

A new framework named ThinkSafe has been developed by researchers to enhance safety in large reasoning models (LRMs) without the need for external teacher models. These LRMs utilize reinforcement learning for reasoning tasks, producing extensive chain-of-thought (CoT) reasoning, but they often prioritize compliance excessively, which can lead to susceptibility to harmful prompts. Current methods that employ external teacher distillation create a distributional gap that undermines native reasoning. The research team defines safety realignment as a KL projection onto a safe simplex, demonstrating that the optimal target for the student's safety-filtered distribution is unique, while any external teacher results in an unavoidable KL penalty. ThinkSafe capitalizes on the idea that models can still recognize harm despite compliance dampening safety measures. The paper can be found on arXiv with ID 2601.23143.

Key facts

ThinkSafe is a self-generated alignment framework for LRMs
It restores safety without external teachers
LRMs use RL on reasoning tasks to generate long CoT reasoning
Over-optimization for compliance makes models vulnerable to harmful prompts
External teacher distillation causes distributional discrepancy degrading reasoning
Safety realignment is formalized as KL projection onto the safe simplex
Student's own safety-filtered distribution is the unique KL-optimal target
External teacher incurs irreducible excess KL penalty
Models retain latent knowledge to identify harm despite compliance suppression
Paper available on arXiv: 2601.23143

ThinkSafe: Self-Generated Safety Alignment for Reasoning Models

Key facts

Entities

Institutions

Sources