Chain-of-Thought Disrupts Refusal Steering in Reasoning Models

ai-technology · 2026-05-27

A recent study published on arXiv (2605.26772) indicates that chain-of-thought (CoT) reasoning in large reasoning models (LRMs) makes refusal control more challenging. In the case of DeepSeek-R1-Distill-LLaMA-8B, when CoT is fixed, activation steering only reverses refusal in 39% of instances; however, eliminating CoT boosts this rate to 70%. Implementing a two-stage intervention that allows the model to regenerate CoT during steering achieves a 94% reversal of refusals. Furthermore, the CoT generated retains 48% of its effect even after steering is removed, suggesting that CoT can independently signal compliance. In contrast to instruction-tuned LLMs, where refusal is managed by a single directional subspace, LRMs encode refusal collectively within CoT and model activations.

Key facts

arXiv paper 2605.26772
DeepSeek-R1-Distill-LLaMA-8B model
Activation steering reverses refusal in 39% of cases with fixed CoT
Removing CoT increases refusal reversal to 70%
Two-stage intervention achieves 94% refusal reversal
Resulting CoT alone retains 48% of effect after steering removal
CoT can independently carry compliance signals
Refusal in LRMs is jointly encoded in CoT and activations

Chain-of-Thought Disrupts Refusal Steering in Reasoning Models

Key facts

Entities

Institutions

Sources