Probe-Based Deception Detection in LLMs Fails Under Stylistic Shift
A recent study published on arXiv (2605.27958) rigorously evaluates linear probes, which are trained on LLM activations, as metrics for detecting deception within the Gemma 3 model family (1B-27B parameters). Although these probes show nearly flawless AUROC scores (≥0.998) on pristine benchmarks, they falter when faced with distributional shifts. The research investigates four theories regarding the encoding of deception: single linear direction, multi-dimensional subspace, convex conic hull, and entropy proxy. Through various methodologies, including cross-domain transfer matrices and multi-dimensional probe analysis, the authors demonstrate that style-augmented probes achieve nearly perfect detection (mean AUROC 0.979–0.983) on previously unseen styles, while also diagnosing the reasons behind probe failures, enhancing understanding of deceptive representations in LLMs.
Key facts
- Paper on arXiv 2605.27958
- Tests Gemma 3 models 1B-27B parameters
- Probes achieve AUROC ≥0.998 on clean data
- Probes collapse under stylistic shift
- Four hypotheses tested: single linear direction, multi-dimensional subspace, convex conic hull, entropy proxy
- Uses cross-domain transfer matrices, permutation null baselines, entropy-residualization, distractor evaluations
- Eight stylistic shifts evaluated
- Style-augmented probes achieve mean AUROC 0.979-0.983 on unseen styles
Entities
Institutions
- arXiv