GraphDPO: Optimizing Language Models Over Preference Graphs

ai-technology · 2026-05-11

A new method called Graph Direct Preference Optimization (GraphDPO) has been introduced by researchers, extending the concept of Direct Preference Optimization (DPO). Unlike DPO, which relies on pairwise preference comparisons to align language models, GraphDPO utilizes directed acyclic preference graphs formed from rollout rankings. This new approach captures complex preference structures from multiple rollouts, avoiding issues of transitivity and conflicting supervision. By representing dominance relations as edges, GraphDPO optimizes a Plackett-Luce-inspired objective across graph neighborhoods, ensuring transitivity and encompassing standard DPO as a specific instance. This technique also mitigates optimization instability that arises when transforming multi-rollout data into separate pairs. The findings are available on arXiv with the identifier 2605.08037.

Key facts

GraphDPO generalizes DPO to operate over directed acyclic preference graphs.
DPO aligns language models using pairwise preference comparisons.
Multiple rollouts per prompt induce rich preference structure that pairwise DPO fails to exploit.
Collapsing multi-rollout data into independent pairs discards transitivity and introduces redundant supervision.
GraphDPO encodes dominance relations as edges in a preference graph.
The objective is a graph-structured Plackett-Luce-inspired function.
GraphDPO aggregates supervision over graph neighborhoods.
Standard DPO is a special case of GraphDPO.

GraphDPO: Optimizing Language Models Over Preference Graphs

Key facts

Entities

Institutions

Sources