Linear Probes May Generalize Better Using Persona Coordinates for AI Safety

ai-technology · 2026-05-12

A new arXiv preprint (2605.09391) investigates whether linear probes, a white-box monitoring method, can generalize better under distribution shift by operating in a low-dimensional subspace of model internals. The authors propose constructing persona axes for deception and sycophancy using contrastive persona prompts, inspired by the Assistant Axis and Persona Selection Model. Unsupervised PCA of persona-specific vectors yields first principal components that cleanly separate harmful behaviors, potentially improving robustness against strategic deception and sandbagging in language models. The study addresses the failure of current probes under distribution shift, which limits their real-world utility for monitoring harmful behaviors during model interactions.

Key facts

arXiv:2605.09391v1 is a new paper on linear probes for AI safety.
Text-only monitoring is insufficient due to strategic deception and sandbagging.
White-box monitors like linear probes can read model internals directly.
Current probes fail under distribution shift, limiting real-world use.
The study explores a low-dimensional subspace of model internals for robust harmful behavior capture.
Persona axes for deception and sycophancy are constructed using contrastive persona prompts.
Unsupervised PCA of persona-specific vectors produces first principal components that separate harmful behaviors.
The approach is inspired by the Assistant Axis and Persona Selection Model.

Linear Probes May Generalize Better Using Persona Coordinates for AI Safety

Key facts

Entities

Institutions

Sources