Semi-DPO: Semi-Supervised Learning for Noisy Preferences in Diffusion DPO

ai-technology · 2026-04-30

A recent study published on arXiv (2604.24952) presents Semi-DPO, a semi-supervised learning method designed to tackle label noise in Diffusion Direct Preference Optimization (DPO). While human visual preferences are inherently multi-dimensional, current datasets reduce them to simple binary labels (winner/loser), leading to conflicting gradient signals. Semi-DPO identifies consistent preference pairs as clean labeled data and treats conflicting pairs as noisy unlabeled data. Initially, it trains on a clean subset filtered for consensus, and subsequently employs the model as an implicit classifier to create pseudo-labels for the noisy data, allowing for iterative refinement. This approach achieves cutting-edge performance.

Key facts

arXiv paper 2604.24952
Semi-DPO addresses label noise in Diffusion DPO
Human visual preferences are multi-dimensional
Existing datasets use single binary labels
Conflicting gradient signals misguide DPO
Semi-DPO uses semi-supervised learning
Consistent pairs are clean labeled data
Conflicting pairs are noisy unlabeled data
Consensus-filtered clean subset for initial training
Implicit classifier generates pseudo-labels
Iterative refinement improves performance
State-of-the-art results reported

Semi-DPO: Semi-Supervised Learning for Noisy Preferences in Diffusion DPO

Key facts

Entities

Institutions

Sources