ARTFEED — Contemporary Art Intelligence

Covariance-Aware GRPO Improves LLM Reasoning by Taming Extreme Tokens

other · 2026-05-13

A new hyperparameter-free method, Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting, addresses the exploration-exploitation tradeoff in Group Relative Policy Optimization (GRPO) for large language models. The approach dynamically down-weights extreme token-level updates using a Gaussian kernel, motivated by the theoretical insight that entropy changes are governed by covariance between token probabilities and advantages. Empirical evaluations show improved downstream performance across reasoning benchmarks compared to standard GRPO, while stabilizing entropy during training. The method is presented in a paper on arXiv (2605.11538) under Computer Science > Computation and Language.

Key facts

  • Method is called Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
  • It is hyperparameter-free
  • Down-weights extreme token-level updates via Gaussian kernel
  • Motivated by covariance between token probabilities and advantages
  • Improves reasoning benchmarks over standard GRPO
  • Stabilizes entropy during training
  • Published on arXiv with ID 2605.11538
  • Filed under Computer Science > Computation and Language

Entities

Institutions

  • arXiv

Sources