ARTFEED — Contemporary Art Intelligence

Balanced Aggregation Fixes Bias in GRPO Reinforcement Learning

other · 2026-05-07

A new study from arXiv identifies and resolves aggregation bias in Group Relative Policy Optimization (GRPO), a widely used reinforcement learning method for large language models. Researchers show that standard sequence aggregation and alternative token aggregation both introduce distinct biases: token aggregation creates sign-length coupling, while sequence aggregation downweights longer responses. They propose Balanced Aggregation (BA), a drop-in replacement that computes token-level means separately within positive and negative subsets and combines them with sequence-count-based weighting. The method aims to improve reasoning and code generation tasks.

Key facts

  • arXiv paper 2605.04077v1
  • RLVR is central for reasoning and code generation in LLMs
  • GRPO-style training is widely adopted
  • Sequence aggregation is standard in GRPO
  • Token aggregation has been advocated as better alternative
  • Token aggregation introduces sign-length coupling
  • Sequence aggregation implicitly downweights longer responses
  • Balanced Aggregation (BA) is proposed as a drop-in replacement

Entities

Institutions

  • arXiv

Sources