Covariance-Aware GRPO Improves LLM Reasoning by Taming Extreme Tokens

other · 2026-05-13

A new hyperparameter-free method, Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting, addresses the exploration-exploitation tradeoff in Group Relative Policy Optimization (GRPO) for large language models. The approach dynamically down-weights extreme token-level updates using a Gaussian kernel, motivated by the theoretical insight that entropy changes are governed by covariance between token probabilities and advantages. Empirical evaluations show improved downstream performance across reasoning benchmarks compared to standard GRPO, while stabilizing entropy during training. The method is presented in a paper on arXiv (2605.11538) under Computer Science > Computation and Language.

Key facts

Method is called Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
It is hyperparameter-free
Down-weights extreme token-level updates via Gaussian kernel
Motivated by covariance between token probabilities and advantages
Improves reasoning benchmarks over standard GRPO
Stabilizes entropy during training
Published on arXiv with ID 2605.11538
Filed under Computer Science > Computation and Language

Covariance-Aware GRPO Improves LLM Reasoning by Taming Extreme Tokens

Key facts

Entities

Institutions

Sources