Probe-Based Deception Detection in LLMs Fails Under Stylistic Shift

ai-technology · 2026-05-28

A recent study published on arXiv (2605.27958) rigorously evaluates linear probes, which are trained on LLM activations, as metrics for detecting deception within the Gemma 3 model family (1B-27B parameters). Although these probes show nearly flawless AUROC scores (≥0.998) on pristine benchmarks, they falter when faced with distributional shifts. The research investigates four theories regarding the encoding of deception: single linear direction, multi-dimensional subspace, convex conic hull, and entropy proxy. Through various methodologies, including cross-domain transfer matrices and multi-dimensional probe analysis, the authors demonstrate that style-augmented probes achieve nearly perfect detection (mean AUROC 0.979–0.983) on previously unseen styles, while also diagnosing the reasons behind probe failures, enhancing understanding of deceptive representations in LLMs.

Key facts

Paper on arXiv 2605.27958
Tests Gemma 3 models 1B-27B parameters
Probes achieve AUROC ≥0.998 on clean data
Probes collapse under stylistic shift
Four hypotheses tested: single linear direction, multi-dimensional subspace, convex conic hull, entropy proxy
Uses cross-domain transfer matrices, permutation null baselines, entropy-residualization, distractor evaluations
Eight stylistic shifts evaluated
Style-augmented probes achieve mean AUROC 0.979-0.983 on unseen styles

Probe-Based Deception Detection in LLMs Fails Under Stylistic Shift

Key facts

Entities

Institutions

Sources