Unified Pair-GRPO Framework for Stable LLM Alignment
A study published on arXiv (2605.06375) presents the Pair-GRPO family, a comprehensive theoretical framework aimed at optimizing large language models (LLMs) through preference-based reinforcement learning. This framework includes two versions: Soft-Pair-GRPO and Hard-Pair-GRPO. Soft-Pair-GRPO makes minimal adjustments to Group Relative Policy Optimization (GRPO) by substituting group-normalized scalar rewards with binary pairwise preference rewards, while preserving the clipped surrogate and KL-regularized elements of GRPO. The authors establish a gradient equivalence theorem, demonstrating that, under first-order Taylor expansion, the gradient of Soft-Pair-GRPO is a positive scalar multiple of the gradient from standard GRPO. This advancement tackles challenges in RLHF, including unstable policy updates and high gradient variance.
Key facts
- The Pair-GRPO family includes Soft-Pair-GRPO and Hard-Pair-GRPO.
- Soft-Pair-GRPO replaces group-normalized scalar rewards with binary pairwise preference rewards.
- It retains GRPO's clipped surrogate and KL-regularized structure.
- A gradient equivalence theorem is proved for Soft-Pair-GRPO.
- The framework addresses unstable policy updates in RLHF.
- It targets ambiguous gradient directions and high gradient variance.
- The paper is published on arXiv with ID 2605.06375.
- The approach is a unified theoretical framework for preference-based RL optimization.
Entities
Institutions
- arXiv