Balanced Aggregation Fixes Bias in GRPO Reinforcement Learning
A new study from arXiv identifies and resolves aggregation bias in Group Relative Policy Optimization (GRPO), a widely used reinforcement learning method for large language models. Researchers show that standard sequence aggregation and alternative token aggregation both introduce distinct biases: token aggregation creates sign-length coupling, while sequence aggregation downweights longer responses. They propose Balanced Aggregation (BA), a drop-in replacement that computes token-level means separately within positive and negative subsets and combines them with sequence-count-based weighting. The method aims to improve reasoning and code generation tasks.
Key facts
- arXiv paper 2605.04077v1
- RLVR is central for reasoning and code generation in LLMs
- GRPO-style training is widely adopted
- Sequence aggregation is standard in GRPO
- Token aggregation has been advocated as better alternative
- Token aggregation introduces sign-length coupling
- Sequence aggregation implicitly downweights longer responses
- Balanced Aggregation (BA) is proposed as a drop-in replacement
Entities
Institutions
- arXiv