CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations
A new method called CroCo extends contrastive preference tuning to multiple languages without requiring language-specific preference annotations. Using a reward model trained on English preferences atop a multilingual base, CroCo produces useful within-language rankings across 14 high and low-resource languages. The approach improves performance on most setups while preventing catastrophic forgetting of supervised fine-tuning. Gains depend on on-policy data; off-policy responses reduce benefits and online preference optimization fails.
Key facts
- CroCo extends contrastive preference tuning to multiple languages.
- No language-specific preference annotation is needed.
- Reward model trained on English preferences atop multilingual base.
- Evaluated on 14 high and low-resource languages.
- Improves performance on majority of setups.
- Prevents catastrophic forgetting of supervised fine-tuning.
- Gains require on-policy data.
- Off-policy responses reduce benefit; online preference optimization fails.
Entities
—