LLM Refusal Behavior Detectable in Intermediate Activations

ai-technology · 2026-05-28

A recent paper on arXiv (2605.28553) reveals that it is possible to anticipate refusal behavior in large language models by analyzing intermediate activations prior to decoding. Researchers utilized linear probes on the residual stream activations from each transformer block, discovering that refusal can be linearly decoded significantly earlier than the final layer, suggesting that safety-related representations are present before output generation. They proposed Mechanistic AutoDAN, a variant of AutoDAN that employs probe guidance, substituting full-model fitness evaluations with partial forward passes and probe-based scoring within a genetic prompt search loop. This approach achieves competitive attack success rates compared to traditional AutoDAN while decreasing search time per iteration by as much as 72%. Additionally, probe-guided prompts often match or surpass AutoDAN's cross-model transfer in various configurations, with the effectiveness of probe guidance increasing alongside model size and safety alignment strength.

Key facts

Refusal behavior is linearly decodable from intermediate LLM activations before the final layer.
Mechanistic AutoDAN uses probe-guided scoring to replace full-model fitness evaluation.
Attack success rates are competitive with vanilla AutoDAN.
Per-iteration search time reduced by up to 72%.
Probe-guided prompts match or exceed AutoDAN's cross-model transfer.
Probe guidance usefulness increases with model size and safety alignment strength.
Research conducted on arXiv paper 2605.28553.
Method uses linear probes on residual stream activations at each transformer block.

LLM Refusal Behavior Detectable in Intermediate Activations

Key facts

Entities

Institutions

Sources