Tournament-GRPO: Group-Wise Tournament Rewards for RL in Long-Form Generation

ai-technology · 2026-05-27

A novel framework for reinforcement learning, known as Tournament-GRPO, tackles the difficulties associated with open-ended long-form generation in scenarios where reference answers and automated metrics are absent. Rather than depending on pointwise LLM-as-a-judge evaluations—which can be challenging to calibrate and may reach saturation—Tournament-GRPO transforms rubric-based LLM assessments into relative rewards through iterative multi-round tournaments involving same-query rollouts. This approach evaluates candidates within groups, compiles tournament results, and normalizes them into group-specific rewards for GRPO training. Tests conducted on Deep Research Bench reveal that Tournament-GRPO consistently surpasses current reward-design benchmarks, achieving an overall score enhancement of 4.52 points compared to the leading baseline. The research is accessible on arXiv under ID 2605.26958.

Key facts

Tournament-GRPO is a group-wise reward framework for reinforcement learning in open-ended long-form generation.
It uses repeated multi-round tournaments among same-query rollouts to convert rubric-guided LLM judgments into relative rewards.
The method normalizes tournament outcomes into group-wise rewards for GRPO training.
Experiments on Deep Research Bench show a 4.52-point overall-score improvement over the strongest baseline.
The paper is published on arXiv with ID 2605.26958.
Existing rubric-based methods rely on pointwise LLM-as-a-judge scoring, which can be difficult to calibrate and saturate.
Tournament-GRPO provides stronger discrimination among same-query rollouts.
The framework consistently outperforms existing reward-design baselines.

Tournament-GRPO: Group-Wise Tournament Rewards for RL in Long-Form Generation

Key facts

Entities

Institutions

Sources