ξ-DPO: A New Preference Optimization Method for LLMs

ai-technology · 2026-05-13

Researchers propose ξ-DPO (Direct Preference Optimization via Ratio Reward Margin) as a solution to hyperparameter tuning challenges in reference-free preference optimization for large language models. The method addresses issues with SimPO (Simple Preference Optimization), where joint tuning of β and γ is difficult due to uninterpretable margin formulation across datasets. Analysis shows β implicitly controls sample filtering, while γ's effect depends on reward gap structure. ξ-DPO reformulates the preference objective using an equivalent transformation to improve interpretability and performance. The paper is available on arXiv under ID 2605.10981.

Key facts

ξ-DPO is a new preference optimization method.
It addresses hyperparameter tuning challenges in SimPO.
SimPO eliminates explicit reference model for efficiency.
β in SimPO controls sample filtering implicitly.
γ's effect depends on dataset reward gap structure.
ξ-DPO uses ratio reward margin for reformulation.
The paper is on arXiv with ID 2605.10981.
The method aims to improve interpretability across datasets.

ξ-DPO: A New Preference Optimization Method for LLMs

Key facts

Entities

Institutions

Sources