Safety Failures Found in Large Reasoning Models

ai-technology · 2026-05-09

A new study reveals that large reasoning models (LRMs) expose safety risks in their chain-of-thought reasoning traces, even when final answers appear safe. Researchers tested 15 open-weight and API-based LRMs across 41K prompts each, using a twenty-principle safety rubric. They identified 'leak cases' (unsafe reasoning with safe answers) and 'escape cases' (safe reasoning with unsafe answers). The study used prompts from seven harmfulness and jailbreak sources plus four out-of-distribution sources. Findings highlight that final-answer safety is insufficient as a proxy for overall safety.

Key facts

Large reasoning models expose safety risks in reasoning traces.
15 open-weight and API-based LRMs were evaluated.
41K prompts per model were used.
A twenty-principle safety rubric was applied.
Prompts came from seven harmfulness and jailbreak sources.
Four out-of-distribution sources were included.
Leak cases: unsafe reasoning, safe answer.
Escape cases: safe reasoning, unsafe answer.

Safety Failures Found in Large Reasoning Models

Key facts

Entities

Institutions

Sources