On-Policy Entropy Flow Optimization Prevents Entropy Collapse in RLVR

publication · 2026-05-13

A new paper on arXiv (2605.11491) identifies entropy collapse in reinforcement learning with verifiable rewards (RLVR) for large language models as a token-level entropy flow imbalance, where entropy-decreasing tokens consistently outnumber entropy-increasing ones. The authors propose On-Policy Entropy Flow Optimization (OP) to address this, offering a unified explanation for collapse in algorithms like GRPO and improving upon coarse-grained entropy regularization and ratio-based clipping heuristics.

Key facts

arXiv paper 2605.11491
RLVR algorithms like GRPO suffer from entropy collapse
Entropy collapse leads to premature determinism and unstable optimization
Existing remedies include entropy regularization and ratio-based clipping heuristics
Paper revisits entropy collapse from token-level entropy flow perspective
Entropy-decreasing tokens consistently outweigh entropy-increasing ones
Proposes On-Policy Entropy Flow Optimization (OP)
Provides unified explanation of entropy collapse in existing RLVR algorithms

On-Policy Entropy Flow Optimization Prevents Entropy Collapse in RLVR

Key facts

Entities

Institutions

Sources