Tie Training Mitigates Spurious Correlations in Preference Optimization
A new theoretical analysis reveals that preference learning methods like Direct Preference Optimization (DPO) inherently induce reliance on spurious correlations, causing sycophancy and length bias in language models. The study, published on arXiv, identifies two mechanisms: mean spurious bias and causal-spurious correlation leakage. It shows that increasing data from the same distribution fails to reduce this dependence. The authors propose tie training as a provable mitigation strategy.
Key facts
- Preference learning methods like DPO induce reliance on spurious correlations.
- Spurious correlations lead to sycophancy and length bias in language models.
- The study provides a unified theoretical analysis of spurious learning.
- Two channels of spurious feature reliance: mean spurious bias and causal-spurious correlation leakage.
- More data from the same training distribution does not reduce spurious feature dependence.
- Tie training is proposed as a mitigation strategy.
- The analysis focuses on log-linear policies.
- The paper is available on arXiv with ID 2605.11134.
Entities
Institutions
- arXiv