Safety Internal: Training AI to Self-Verify Response Safety
A new framework called Safety Internal (SInternal) proposes training large reasoning models (LRMs) exclusively on safety verification tasks to internalize safety specifications. The approach addresses the limitation that current alignment methods only enforce external compliance, leaving models vulnerable to adversarial jailbreaks. By using expert reasoning trajectories to critique their own generated answers, models learn to verify response safety, inducing strong generalization. The research, published on arXiv, demonstrates that learning to verify significantly enhances robustness against malicious prompts.
Key facts
- arXiv:2605.08930v1
- Safety Internal (SInternal) framework proposed
- Trains LRMs on safety verification tasks
- Uses expert reasoning trajectories for self-critique
- Addresses vulnerability to adversarial jailbreaks
- Current alignment methods rely on external compliance
- Empirical analysis shows lack of intrinsic safety understanding
- Learning to verify induces strong generalization for response safety
Entities
Institutions
- arXiv