PPR-GDE: New RL Method for Open-Ended Generation Without Scalar Rewards
A new reinforcement learning method called Pairwise Preference Reward and Group-based Diversity Enhancement (PPR-GDE) has been proposed for open-ended generation tasks. Unlike traditional RL methods that rely on scalar rewards, PPR-GDE uses pairwise preference rewards to capture subjective evaluation and incorporates group-level diversity into the reward signal to prevent diversity collapse. The method also mitigates judge position bias through repeated comparisons with swapped response order. This approach addresses challenges in verifying correctness and reducing computational costs in open-domain scenarios.
Key facts
- PPR-GDE is a reinforcement learning method for open-ended generation.
- It does not require scalar rewards.
- It uses pairwise preference rewards for subjective evaluation.
- It incorporates group-level diversity into the reward signal.
- It mitigates judge position bias via repeated comparisons with swapped response order.
- Traditional RL methods often lead to diversity collapse in open-ended tasks.
- Verifying correctness in open-ended generation is challenging.
- Training reward models incurs substantial computational and annotation costs.
Entities
—