ARTFEED — Contemporary Art Intelligence

Research Shows Harmful Intent Detectable as Geometric Feature in LLM Residual Streams

ai-technology · 2026-04-22

A study demonstrates that harmful intent can be identified geometrically within large language model residual streams, appearing as a linear direction in most layers and as angular deviation where projection methods fail. The research examined 12 models across four architectural families—Qwen2.5, Qwen3.5, Llama-3.2, and Gemma-3—with three alignment variants: base, instruction-tuned, and abliterated. Six direction-finding strategies were tested under single-turn English evaluation. Three methods proved successful: a soft-AUC-optimized linear direction achieved a mean AUROC of 0.98 and TPR@1%FPR of 0.80; a class-mean probe reached 0.98 and 0.71 with fitting costs under 1ms; and a supervised angular-deviation strategy attained AUROC 0.96 and TPR 0.61 along a representationally distinct direction, 73 degrees from projection-based solutions. This angular-deviation approach uniquely maintained detection in middle layers where projection methods collapsed. Detection stability persisted across all alignment variants, including abliterated models. The findings characterize the geometry of harmful intent through multiple analytical strategies, revealing distinct directional patterns in model representations.

Key facts

  • Harmful intent is geometrically recoverable from LLM residual streams
  • Study examined 12 models across four architectural families: Qwen2.5, Qwen3.5, Llama-3.2, Gemma-3
  • Three alignment variants tested: base, instruction-tuned, abliterated
  • Soft-AUC-optimized linear direction achieved mean AUROC 0.98 and TPR@1%FPR 0.80
  • Class-mean probe reached AUROC 0.98 and TPR 0.71 with <1ms fitting cost
  • Supervised angular-deviation strategy achieved AUROC 0.96 and TPR 0.61 along 73° distinct direction
  • Angular-deviation method sustained detection in middle layers where projection methods collapsed
  • Detection remained stable across all alignment variants including abliterated models

Entities

Institutions

  • arXiv

Sources