ActiveDPO: Sample-Efficient LLM Alignment via Active Data Selection
ActiveDPO is a new algorithm for aligning large language models (LLMs) with human preferences using active data selection. It addresses the high cost of collecting preference annotations by selecting the most informative data points. Unlike prior methods, ActiveDPO does not rely on restrictive assumptions like linear reward functions; instead, it uses the LLM itself to parameterize the reward model for data selection. The approach is grounded in theory and aims to improve sample efficiency in alignment tasks such as question answering, mathematical reasoning, and code generation. The paper is available on arXiv under ID 2505.19241.
Key facts
- ActiveDPO is an algorithm for sample-efficient LLM alignment.
- It uses active data selection to reduce human preference annotation costs.
- The method works with non-linear reward functions.
- The LLM itself parameterizes the reward model for data selection.
- The approach is theoretically grounded.
- It targets downstream tasks like question answering, math reasoning, and code generation.
- The paper is on arXiv (2505.19241).
- Existing methods often rely on restrictive assumptions about reward functions.
Entities
—