ξ-DPO: A New Preference Optimization Method for LLMs
Researchers propose ξ-DPO (Direct Preference Optimization via Ratio Reward Margin) as a solution to hyperparameter tuning challenges in reference-free preference optimization for large language models. The method addresses issues with SimPO (Simple Preference Optimization), where joint tuning of β and γ is difficult due to uninterpretable margin formulation across datasets. Analysis shows β implicitly controls sample filtering, while γ's effect depends on reward gap structure. ξ-DPO reformulates the preference objective using an equivalent transformation to improve interpretability and performance. The paper is available on arXiv under ID 2605.10981.
Key facts
- ξ-DPO is a new preference optimization method.
- It addresses hyperparameter tuning challenges in SimPO.
- SimPO eliminates explicit reference model for efficiency.
- β in SimPO controls sample filtering implicitly.
- γ's effect depends on dataset reward gap structure.
- ξ-DPO uses ratio reward margin for reformulation.
- The paper is on arXiv with ID 2605.10981.
- The method aims to improve interpretability across datasets.
Entities
Institutions
- arXiv