Probing Persona-Dependent Preferences in Large Language Models
A recent study published on arXiv (2605.13339) examines how large language models (LLMs) represent preferences through various personas. Researchers utilized linear probes on the residual-stream activations of Gemma-3-27B and Qwen-3.5-122B to anticipate pairwise task selections. They discovered a true preference vector that consistently reflects model inclinations across different prompts and contexts. Notably, manipulating this vector on Gemma-3-27B directly influences pairwise decisions. Importantly, this preference framework is predominantly shared among different personas; a probe designed for a helpful assistant can effectively predict and guide the choices of distinctly different personas. These results imply that LLMs may operate on a unified internal preference system despite apparent behavioral variations.
Key facts
- Study published on arXiv with ID 2605.13339
- Models used: Gemma-3-27B and Qwen-3.5-122B
- Linear probes trained on residual-stream activations
- Preference vector identified that tracks choices across prompts
- Steering along preference vector causally controls pairwise choice on Gemma-3-27B
- Preference representation is shared across personas
- Probe trained on helpful assistant predicts choices of other personas
- Research explores internal implementation of persona-dependent preferences
Entities
Institutions
- arXiv