Latent-Space Attacks Evade Refusal in Language Models

ai-technology · 2026-05-23

A new paper on arXiv reframes refusal suppression in safety-aligned language models as a latent-space evasion attack against linear probes. The authors show that prior work's difference-in-means direction defines a probe, and its ablation is a projection onto the decision boundary—a minimum-confidence evasion attack. This explains empirical success but reveals a limitation: evasion stops at the boundary, motivating further research. The study provides a principled account of the transformation induced by refusal suppression methods.

Key facts

arXiv:2605.21706v1
Safety-aligned language models refuse harmful requests
Refusal behavior can be suppressed by steering internal representations
Existing methods ablate a refusal direction from model activations
Lack of principled account of latent-space transformation
Recasts refusal suppression as latent-space evasion attack against linear probes
Difference-in-means direction defines a probe
Ablation is projection onto decision boundary
Minimum-confidence evasion attack
Limitation: evasion stops at decision boundary

Latent-Space Attacks Evade Refusal in Language Models

Key facts

Entities

Institutions

Sources