Balanced Aggregation Fixes Bias in GRPO Reinforcement Learning

other · 2026-05-07

A new study from arXiv identifies and resolves aggregation bias in Group Relative Policy Optimization (GRPO), a widely used reinforcement learning method for large language models. Researchers show that standard sequence aggregation and alternative token aggregation both introduce distinct biases: token aggregation creates sign-length coupling, while sequence aggregation downweights longer responses. They propose Balanced Aggregation (BA), a drop-in replacement that computes token-level means separately within positive and negative subsets and combines them with sequence-count-based weighting. The method aims to improve reasoning and code generation tasks.

Key facts

arXiv paper 2605.04077v1
RLVR is central for reasoning and code generation in LLMs
GRPO-style training is widely adopted
Sequence aggregation is standard in GRPO
Token aggregation has been advocated as better alternative
Token aggregation introduces sign-length coupling
Sequence aggregation implicitly downweights longer responses
Balanced Aggregation (BA) is proposed as a drop-in replacement

Balanced Aggregation Fixes Bias in GRPO Reinforcement Learning

Key facts

Entities

Institutions

Sources