Persona Vectors Form Early in LLM Pretraining
A new study from arXiv traces the formation of persona vectors—linear directions in internal activations corresponding to high-level behaviors like sycophancy—across the pretraining of OLMo-3-7B. These vectors form within 0.22% of pretraining and remain effective for steering fully post-trained instruct models. Although core representations emerge early, they continue to refine geometrically and semantically throughout training. The research addresses a gap in AI safety interpretability, as persona vectors are routinely used to inspect and steer model behavior.
Key facts
- Persona vectors form within 0.22% of OLMo-3 pretraining.
- Vectors remain effective for steering fully post-trained instruct models.
- Core representations refine geometrically and semantically throughout training.
- Study addresses interpretability gap in AI safety.
- Persona vectors correspond to traits like evil or sycophancy.
- Research uses OLMo-3-7B model.
- Findings published on arXiv.
- Vectors are linear directions in internal activations.
Entities
Institutions
- arXiv