DPO and RLHF Equivalence Is Conditional, Not Universal

other · 2026-05-22

A recent study published on arXiv demonstrates that Direct Preference Optimization (DPO) is not always equivalent to Reinforcement Learning from Human Feedback (RLHF). This equivalence hinges on an implicit premise that the optimal policy under RLHF should favor responses preferred by humans. When this premise is not met, DPO focuses on optimizing relative advantages compared to the reference policy instead of achieving absolute alignment with human preferences. This can result in problematic convergence, where policies reduce DPO loss while still favoring less desirable responses. The authors identify the conditions under which this assumption fails, reveal an undesirable solution space, and establish that DPO and RLHF pursue fundamentally different goals in these scenarios. To remedy this, they propose Constrained Preference Optimization (CPO), which enhances RLHF with alignment constraints.

Key facts

DPO and RLHF equivalence is conditional, not universal.
Equivalence depends on an implicit assumption frequently violated in practice.
When assumption fails, DPO optimizes relative advantage over reference policy.
Pathological convergence: policies decrease DPO loss while preferring dispreferred responses.
Authors characterize when assumption is violated.
Undesirable solution space exists.
DPO and RLHF optimize fundamentally different objectives in such cases.
Constrained Preference Optimization (CPO) introduced for provable alignment.

DPO and RLHF Equivalence Is Conditional, Not Universal

Key facts

Entities

Institutions

Sources