ReasoningGuard: Inference-Time Safety for Large Reasoning Models

ai-technology · 2026-05-07

A new method called ReasoningGuard aims to protect Large Reasoning Models (LRMs) from generating harmful content during reasoning. Unlike existing defenses that require costly fine-tuning and expert knowledge, ReasoningGuard operates at inference time by injecting safety-oriented reflections—termed 'safety aha moments'—into the model's reasoning process. It uses the model's internal attention mechanisms to identify critical points in the reasoning path and triggers safety checks. A scaling sampling strategy then selects the optimal reasoning path to ensure both intermediate steps and final answers are safe. The approach adds minimal inference cost and is designed to be scalable. The paper is available on arXiv under ID 2508.04204.

Key facts

ReasoningGuard is an inference-time safeguard for Large Reasoning Models (LRMs).
It injects timely safety aha moments during reasoning to guide models toward harmless outputs.
The method leverages internal attention mechanisms to identify key points in reasoning.
A scaling sampling strategy selects the optimal reasoning path during decoding.
Current defense methods rely on costly fine-tuning and expert knowledge.
LRMs remain vulnerable to harmful content generation, especially in mid-to-late reasoning steps.
ReasoningGuard adds minimal additional inference cost.
The paper is arXiv:2508.04204v2.

ReasoningGuard: Inference-Time Safety for Large Reasoning Models

Key facts

Entities

Institutions

Sources