Persona Steering Vectors Reduce AI Sycophancy Without Accuracy Loss
A new study on arXiv (2605.21006) investigates whether off-the-shelf persona steering vectors can reduce sycophancy in language models—where models agree with users even when wrong. Contrastive Activation Addition (CAA), the standard mitigation, requires labeled sycophancy data. The researchers tested persona vectors originally designed for role-playing, not sycophancy. Steering toward doubt or scrutiny personas reduced sycophancy to 68% and 98% of CAA's effect in two instruction-tuned models, while maintaining accuracy when users are correct. The effect is asymmetric: agreeable personas do not increase sycophancy. Geometrically, persona vectors are largely independent of sycophancy direction in activation space. The findings suggest persona steering as a viable alternative.
Key facts
- arXiv paper 2605.21006 studies sycophancy mitigation using persona steering vectors.
- Standard method CAA uses labeled sycophantic/honest response pairs.
- Off-the-shelf persona vectors were not trained on sycophancy data.
- Doubt and scrutiny personas reduce sycophancy to 68% and 98% of CAA effect.
- Persona steering maintains accuracy when user is correct, unlike CAA.
- Agreeable personas do not mirror increase sycophancy.
- Persona vector is geometrically independent of sycophancy direction.
- Study tested two instruction-tuned models.
Entities
Institutions
- arXiv