ARTFEED — Contemporary Art Intelligence

ξ-DPO: A New Preference Optimization Method for LLMs

ai-technology · 2026-05-13

Researchers propose ξ-DPO (Direct Preference Optimization via Ratio Reward Margin) as a solution to hyperparameter tuning challenges in reference-free preference optimization for large language models. The method addresses issues with SimPO (Simple Preference Optimization), where joint tuning of β and γ is difficult due to uninterpretable margin formulation across datasets. Analysis shows β implicitly controls sample filtering, while γ's effect depends on reward gap structure. ξ-DPO reformulates the preference objective using an equivalent transformation to improve interpretability and performance. The paper is available on arXiv under ID 2605.10981.

Key facts

  • ξ-DPO is a new preference optimization method.
  • It addresses hyperparameter tuning challenges in SimPO.
  • SimPO eliminates explicit reference model for efficiency.
  • β in SimPO controls sample filtering implicitly.
  • γ's effect depends on dataset reward gap structure.
  • ξ-DPO uses ratio reward margin for reformulation.
  • The paper is on arXiv with ID 2605.10981.
  • The method aims to improve interpretability across datasets.

Entities

Institutions

  • arXiv

Sources