ThinkSafe: Self-Generated Safety Alignment for Reasoning Models
A new framework named ThinkSafe has been developed by researchers to enhance safety in large reasoning models (LRMs) without the need for external teacher models. These LRMs utilize reinforcement learning for reasoning tasks, producing extensive chain-of-thought (CoT) reasoning, but they often prioritize compliance excessively, which can lead to susceptibility to harmful prompts. Current methods that employ external teacher distillation create a distributional gap that undermines native reasoning. The research team defines safety realignment as a KL projection onto a safe simplex, demonstrating that the optimal target for the student's safety-filtered distribution is unique, while any external teacher results in an unavoidable KL penalty. ThinkSafe capitalizes on the idea that models can still recognize harm despite compliance dampening safety measures. The paper can be found on arXiv with ID 2601.23143.
Key facts
- ThinkSafe is a self-generated alignment framework for LRMs
- It restores safety without external teachers
- LRMs use RL on reasoning tasks to generate long CoT reasoning
- Over-optimization for compliance makes models vulnerable to harmful prompts
- External teacher distillation causes distributional discrepancy degrading reasoning
- Safety realignment is formalized as KL projection onto the safe simplex
- Student's own safety-filtered distribution is the unique KL-optimal target
- External teacher incurs irreducible excess KL penalty
- Models retain latent knowledge to identify harm despite compliance suppression
- Paper available on arXiv: 2601.23143
Entities
Institutions
- arXiv