DynamicPO Prevents Preference Optimization Collapse in LLM Recommendations
A new study from arXiv (2605.00327) identifies a phenomenon called preference optimization collapse in large language model (LLM)-based recommendation systems using direct preference optimization (DPO). Researchers found that increasing negative samples can degrade performance despite decreasing training loss, due to gradient suppression from easily discriminable negatives overwhelming boundary-critical ones. To address this, they propose Dynamic Preference Optimization (DynamicPO), a lightweight method that dynamically weights negatives to preserve decision boundaries. The work provides both empirical and theoretical analysis of the collapse mechanism.
Key facts
- Preference optimization collapse occurs when increasing negative samples degrades recommendation performance.
- Training loss continuously decreases even as performance drops.
- Gradient suppression from easily discriminable negatives causes the collapse.
- Boundary-critical negatives are under-optimized, weakening the decision boundary.
- DynamicPO is proposed as a lightweight solution to dynamically weight negatives.
- The study includes empirical analyses and theoretical demonstration.
- The paper is published on arXiv with identifier 2605.00327.
- The work focuses on LLM-based recommendation systems using DPO.
Entities
Institutions
- arXiv