Anchor-Projected Representations Enable Cross-Model Behavioral Axis Transfer
A new framework called anchor-projection allows behavioral directions to be transferred across different large language model families without fine-tuning. The method maps hidden representations into a shared anchor coordinate space (ACS), where canonical directions are averaged and reconstructed into new models. Evaluated on five instruction-tuned families (Llama, Qwen, Mistral, Phi, and others) across ten behavioral axes, the approach shows tight alignment within the LQMP cluster, achieving 0.83 ten-way detection accuracy on held-out targets. The paper is published on arXiv as preprint 2605.09875.
Key facts
- Anchor-projection framework maps hidden representations into a shared anchor coordinate space (ACS).
- Behavioral directions from source models are projected into ACS and averaged into a canonical direction.
- For a new model, the canonical direction is reconstructed using only anchor activations, without fine-tuning.
- Evaluated on five instruction-tuned model families: Llama, Qwen, Mistral, Phi, and others.
- Ten behavioral axes were tested.
- Same-axis directions align tightly across the LQMP cluster (Llama, Qwen, Mistral, Phi) in ACS.
- Held-out targets achieved 0.83 ten-way detection accuracy for the aligned LQMP cluster.
- The paper is available on arXiv with ID 2605.09875.
Entities
Institutions
- arXiv