PSR Models Outperform Existing Activation Steering Methods
A new framework, Prompt Steering Replacement (PSR), formulates prompt steering as activation steering and trains models to imitate prompt-based interventions. PSR estimates token-specific steering coefficients from activations, outperforming existing activation steering methods on three benchmarks across multiple language models.
Key facts
- arXiv:2605.03907v1
- PSR models estimate token-specific steering coefficients from activations
- PSR models are trained to imitate prompt-based interventions
- Experiments on three steering benchmarks
- PSR models outperform existing activation steering methods
- Popular activation steering methods are not faithful to prompt steering mechanics
- Prompt steering applies strong interventions on some tokens while barely affecting others
- Framework formulates prompt steering as a form of activation steering
Entities
—