ARTFEED — Contemporary Art Intelligence

Latent-Space Attacks Evade Refusal in Language Models

ai-technology · 2026-05-23

A new paper on arXiv reframes refusal suppression in safety-aligned language models as a latent-space evasion attack against linear probes. The authors show that prior work's difference-in-means direction defines a probe, and its ablation is a projection onto the decision boundary—a minimum-confidence evasion attack. This explains empirical success but reveals a limitation: evasion stops at the boundary, motivating further research. The study provides a principled account of the transformation induced by refusal suppression methods.

Key facts

  • arXiv:2605.21706v1
  • Safety-aligned language models refuse harmful requests
  • Refusal behavior can be suppressed by steering internal representations
  • Existing methods ablate a refusal direction from model activations
  • Lack of principled account of latent-space transformation
  • Recasts refusal suppression as latent-space evasion attack against linear probes
  • Difference-in-means direction defines a probe
  • Ablation is projection onto decision boundary
  • Minimum-confidence evasion attack
  • Limitation: evasion stops at decision boundary

Entities

Institutions

  • arXiv

Sources