ARTFEED — Contemporary Art Intelligence

Probe-Based Deception Detection in LLMs Fails Under Stylistic Shift

ai-technology · 2026-05-28

A recent study published on arXiv (2605.27958) rigorously evaluates linear probes, which are trained on LLM activations, as metrics for detecting deception within the Gemma 3 model family (1B-27B parameters). Although these probes show nearly flawless AUROC scores (≥0.998) on pristine benchmarks, they falter when faced with distributional shifts. The research investigates four theories regarding the encoding of deception: single linear direction, multi-dimensional subspace, convex conic hull, and entropy proxy. Through various methodologies, including cross-domain transfer matrices and multi-dimensional probe analysis, the authors demonstrate that style-augmented probes achieve nearly perfect detection (mean AUROC 0.979–0.983) on previously unseen styles, while also diagnosing the reasons behind probe failures, enhancing understanding of deceptive representations in LLMs.

Key facts

  • Paper on arXiv 2605.27958
  • Tests Gemma 3 models 1B-27B parameters
  • Probes achieve AUROC ≥0.998 on clean data
  • Probes collapse under stylistic shift
  • Four hypotheses tested: single linear direction, multi-dimensional subspace, convex conic hull, entropy proxy
  • Uses cross-domain transfer matrices, permutation null baselines, entropy-residualization, distractor evaluations
  • Eight stylistic shifts evaluated
  • Style-augmented probes achieve mean AUROC 0.979-0.983 on unseen styles

Entities

Institutions

  • arXiv

Sources