Pref-CTRL: Preference-Driven LLM Alignment via Representation Editing
Researchers propose Pref-CTRL, a novel test-time alignment method for large language models that uses a multi-objective value function trained on preference data to edit internal representations during inference. Unlike prior work RE-Control, which uses a single value function, Pref-CTRL better captures the pairwise structure of human preferences between candidate responses. The method outperforms RE-Control on two benchmark datasets and shows improved generalization on out-of-domain datasets. The source code is publicly available.
Key facts
- Pref-CTRL is a test-time alignment method for LLMs.
- It uses a multi-objective value function trained on preference data.
- It edits internal representations during inference.
- It outperforms RE-Control on two benchmark datasets.
- It shows greater generalization on out-of-domain datasets.
- The source code is available.
- The paper is on arXiv with ID 2604.23543.
- RE-Control uses a single value function and gradient-based editing.
Entities
Institutions
- arXiv