Tournament-GRPO: Group-Wise Tournament Rewards for RL in Long-Form Generation
A novel framework for reinforcement learning, known as Tournament-GRPO, tackles the difficulties associated with open-ended long-form generation in scenarios where reference answers and automated metrics are absent. Rather than depending on pointwise LLM-as-a-judge evaluations—which can be challenging to calibrate and may reach saturation—Tournament-GRPO transforms rubric-based LLM assessments into relative rewards through iterative multi-round tournaments involving same-query rollouts. This approach evaluates candidates within groups, compiles tournament results, and normalizes them into group-specific rewards for GRPO training. Tests conducted on Deep Research Bench reveal that Tournament-GRPO consistently surpasses current reward-design benchmarks, achieving an overall score enhancement of 4.52 points compared to the leading baseline. The research is accessible on arXiv under ID 2605.26958.
Key facts
- Tournament-GRPO is a group-wise reward framework for reinforcement learning in open-ended long-form generation.
- It uses repeated multi-round tournaments among same-query rollouts to convert rubric-guided LLM judgments into relative rewards.
- The method normalizes tournament outcomes into group-wise rewards for GRPO training.
- Experiments on Deep Research Bench show a 4.52-point overall-score improvement over the strongest baseline.
- The paper is published on arXiv with ID 2605.26958.
- Existing rubric-based methods rely on pointwise LLM-as-a-judge scoring, which can be difficult to calibrate and saturate.
- Tournament-GRPO provides stronger discrimination among same-query rollouts.
- The framework consistently outperforms existing reward-design baselines.
Entities
Institutions
- arXiv