Semi-DPO: Semi-Supervised Learning for Noisy Preferences in Diffusion DPO
A recent study published on arXiv (2604.24952) presents Semi-DPO, a semi-supervised learning method designed to tackle label noise in Diffusion Direct Preference Optimization (DPO). While human visual preferences are inherently multi-dimensional, current datasets reduce them to simple binary labels (winner/loser), leading to conflicting gradient signals. Semi-DPO identifies consistent preference pairs as clean labeled data and treats conflicting pairs as noisy unlabeled data. Initially, it trains on a clean subset filtered for consensus, and subsequently employs the model as an implicit classifier to create pseudo-labels for the noisy data, allowing for iterative refinement. This approach achieves cutting-edge performance.
Key facts
- arXiv paper 2604.24952
- Semi-DPO addresses label noise in Diffusion DPO
- Human visual preferences are multi-dimensional
- Existing datasets use single binary labels
- Conflicting gradient signals misguide DPO
- Semi-DPO uses semi-supervised learning
- Consistent pairs are clean labeled data
- Conflicting pairs are noisy unlabeled data
- Consensus-filtered clean subset for initial training
- Implicit classifier generates pseudo-labels
- Iterative refinement improves performance
- State-of-the-art results reported
Entities
Institutions
- arXiv