SelectiveRM: Optimal Transport for LLM Reward Modeling from Noisy Preference
A new framework called SelectiveRM has been introduced by researchers to enhance reward modeling in Reinforcement Learning from Human Feedback (RLHF) using optimal transport. Traditional training methods often overfit to noise present in real-world preference datasets, which can be problematic. Current denoising techniques typically assume uniform noise, overlooking the complexities of linguistic preferences. SelectiveRM offers a Joint Consistency Discrepancy to better align model predictions with the distribution of preference data. Additionally, a Mass Relaxation mechanism through partial transport enables the exclusion of samples with noisy preferences that conflict with semantic consistency. Theoretically, SelectiveRM aims to optimize a tighter upper bound on the true unobserved reward. This research is documented in arXiv:2605.06036v1.
Key facts
- SelectiveRM is a framework grounded in optimal transport for reward modeling.
- It addresses noisy preference in RLHF datasets.
- Joint Consistency Discrepancy aligns model predictions with preference data.
- Mass Relaxation mechanism uses partial transport to exclude noisy samples.
- The method optimizes a tighter upper bound on the true reward.
- Published as arXiv:2605.06036v1.
- Conventional training objectives overfit noise in preference data.
- Existing denoising approaches assume homogeneous noise.
Entities
Institutions
- arXiv