Unified Pair-GRPO Framework for Stable LLM Alignment

other · 2026-05-12

A study published on arXiv (2605.06375) presents the Pair-GRPO family, a comprehensive theoretical framework aimed at optimizing large language models (LLMs) through preference-based reinforcement learning. This framework includes two versions: Soft-Pair-GRPO and Hard-Pair-GRPO. Soft-Pair-GRPO makes minimal adjustments to Group Relative Policy Optimization (GRPO) by substituting group-normalized scalar rewards with binary pairwise preference rewards, while preserving the clipped surrogate and KL-regularized elements of GRPO. The authors establish a gradient equivalence theorem, demonstrating that, under first-order Taylor expansion, the gradient of Soft-Pair-GRPO is a positive scalar multiple of the gradient from standard GRPO. This advancement tackles challenges in RLHF, including unstable policy updates and high gradient variance.

Key facts

The Pair-GRPO family includes Soft-Pair-GRPO and Hard-Pair-GRPO.
Soft-Pair-GRPO replaces group-normalized scalar rewards with binary pairwise preference rewards.
It retains GRPO's clipped surrogate and KL-regularized structure.
A gradient equivalence theorem is proved for Soft-Pair-GRPO.
The framework addresses unstable policy updates in RLHF.
It targets ambiguous gradient directions and high gradient variance.
The paper is published on arXiv with ID 2605.06375.
The approach is a unified theoretical framework for preference-based RL optimization.

Unified Pair-GRPO Framework for Stable LLM Alignment

Key facts

Entities

Institutions

Sources