Chain-of-Thought Disrupts Refusal Steering in Reasoning Models
A recent study published on arXiv (2605.26772) indicates that chain-of-thought (CoT) reasoning in large reasoning models (LRMs) makes refusal control more challenging. In the case of DeepSeek-R1-Distill-LLaMA-8B, when CoT is fixed, activation steering only reverses refusal in 39% of instances; however, eliminating CoT boosts this rate to 70%. Implementing a two-stage intervention that allows the model to regenerate CoT during steering achieves a 94% reversal of refusals. Furthermore, the CoT generated retains 48% of its effect even after steering is removed, suggesting that CoT can independently signal compliance. In contrast to instruction-tuned LLMs, where refusal is managed by a single directional subspace, LRMs encode refusal collectively within CoT and model activations.
Key facts
- arXiv paper 2605.26772
- DeepSeek-R1-Distill-LLaMA-8B model
- Activation steering reverses refusal in 39% of cases with fixed CoT
- Removing CoT increases refusal reversal to 70%
- Two-stage intervention achieves 94% refusal reversal
- Resulting CoT alone retains 48% of effect after steering removal
- CoT can independently carry compliance signals
- Refusal in LRMs is jointly encoded in CoT and activations
Entities
Institutions
- arXiv