On-Policy Entropy Flow Optimization Prevents Entropy Collapse in RLVR
A new paper on arXiv (2605.11491) identifies entropy collapse in reinforcement learning with verifiable rewards (RLVR) for large language models as a token-level entropy flow imbalance, where entropy-decreasing tokens consistently outnumber entropy-increasing ones. The authors propose On-Policy Entropy Flow Optimization (OP) to address this, offering a unified explanation for collapse in algorithms like GRPO and improving upon coarse-grained entropy regularization and ratio-based clipping heuristics.
Key facts
- arXiv paper 2605.11491
- RLVR algorithms like GRPO suffer from entropy collapse
- Entropy collapse leads to premature determinism and unstable optimization
- Existing remedies include entropy regularization and ratio-based clipping heuristics
- Paper revisits entropy collapse from token-level entropy flow perspective
- Entropy-decreasing tokens consistently outweigh entropy-increasing ones
- Proposes On-Policy Entropy Flow Optimization (OP)
- Provides unified explanation of entropy collapse in existing RLVR algorithms
Entities
Institutions
- arXiv