Safety Internal: Training AI to Self-Verify Response Safety

ai-technology · 2026-05-12

A new framework called Safety Internal (SInternal) proposes training large reasoning models (LRMs) exclusively on safety verification tasks to internalize safety specifications. The approach addresses the limitation that current alignment methods only enforce external compliance, leaving models vulnerable to adversarial jailbreaks. By using expert reasoning trajectories to critique their own generated answers, models learn to verify response safety, inducing strong generalization. The research, published on arXiv, demonstrates that learning to verify significantly enhances robustness against malicious prompts.

Key facts

arXiv:2605.08930v1
Safety Internal (SInternal) framework proposed
Trains LRMs on safety verification tasks
Uses expert reasoning trajectories for self-critique
Addresses vulnerability to adversarial jailbreaks
Current alignment methods rely on external compliance
Empirical analysis shows lack of intrinsic safety understanding
Learning to verify induces strong generalization for response safety

Safety Internal: Training AI to Self-Verify Response Safety

Key facts

Entities

Institutions

Sources