Persona Vectors Form Early in LLM Pretraining

ai-technology · 2026-05-14

A new study from arXiv traces the formation of persona vectors—linear directions in internal activations corresponding to high-level behaviors like sycophancy—across the pretraining of OLMo-3-7B. These vectors form within 0.22% of pretraining and remain effective for steering fully post-trained instruct models. Although core representations emerge early, they continue to refine geometrically and semantically throughout training. The research addresses a gap in AI safety interpretability, as persona vectors are routinely used to inspect and steer model behavior.

Key facts

Persona vectors form within 0.22% of OLMo-3 pretraining.
Vectors remain effective for steering fully post-trained instruct models.
Core representations refine geometrically and semantically throughout training.
Study addresses interpretability gap in AI safety.
Persona vectors correspond to traits like evil or sycophancy.
Research uses OLMo-3-7B model.
Findings published on arXiv.
Vectors are linear directions in internal activations.

Persona Vectors Form Early in LLM Pretraining

Key facts

Entities

Institutions

Sources