ARTFEED — Contemporary Art Intelligence

Chain-of-Thought Disrupts Refusal Steering in Reasoning Models

ai-technology · 2026-05-27

A recent study published on arXiv (2605.26772) indicates that chain-of-thought (CoT) reasoning in large reasoning models (LRMs) makes refusal control more challenging. In the case of DeepSeek-R1-Distill-LLaMA-8B, when CoT is fixed, activation steering only reverses refusal in 39% of instances; however, eliminating CoT boosts this rate to 70%. Implementing a two-stage intervention that allows the model to regenerate CoT during steering achieves a 94% reversal of refusals. Furthermore, the CoT generated retains 48% of its effect even after steering is removed, suggesting that CoT can independently signal compliance. In contrast to instruction-tuned LLMs, where refusal is managed by a single directional subspace, LRMs encode refusal collectively within CoT and model activations.

Key facts

  • arXiv paper 2605.26772
  • DeepSeek-R1-Distill-LLaMA-8B model
  • Activation steering reverses refusal in 39% of cases with fixed CoT
  • Removing CoT increases refusal reversal to 70%
  • Two-stage intervention achieves 94% refusal reversal
  • Resulting CoT alone retains 48% of effect after steering removal
  • CoT can independently carry compliance signals
  • Refusal in LRMs is jointly encoded in CoT and activations

Entities

Institutions

  • arXiv

Sources