Latent-Space Attacks Evade Refusal in Language Models
A new paper on arXiv reframes refusal suppression in safety-aligned language models as a latent-space evasion attack against linear probes. The authors show that prior work's difference-in-means direction defines a probe, and its ablation is a projection onto the decision boundary—a minimum-confidence evasion attack. This explains empirical success but reveals a limitation: evasion stops at the boundary, motivating further research. The study provides a principled account of the transformation induced by refusal suppression methods.
Key facts
- arXiv:2605.21706v1
- Safety-aligned language models refuse harmful requests
- Refusal behavior can be suppressed by steering internal representations
- Existing methods ablate a refusal direction from model activations
- Lack of principled account of latent-space transformation
- Recasts refusal suppression as latent-space evasion attack against linear probes
- Difference-in-means direction defines a probe
- Ablation is projection onto decision boundary
- Minimum-confidence evasion attack
- Limitation: evasion stops at decision boundary
Entities
Institutions
- arXiv