ARTFEED — Contemporary Art Intelligence

On-Policy Entropy Flow Optimization Prevents Entropy Collapse in RLVR

publication · 2026-05-13

A new paper on arXiv (2605.11491) identifies entropy collapse in reinforcement learning with verifiable rewards (RLVR) for large language models as a token-level entropy flow imbalance, where entropy-decreasing tokens consistently outnumber entropy-increasing ones. The authors propose On-Policy Entropy Flow Optimization (OP) to address this, offering a unified explanation for collapse in algorithms like GRPO and improving upon coarse-grained entropy regularization and ratio-based clipping heuristics.

Key facts

  • arXiv paper 2605.11491
  • RLVR algorithms like GRPO suffer from entropy collapse
  • Entropy collapse leads to premature determinism and unstable optimization
  • Existing remedies include entropy regularization and ratio-based clipping heuristics
  • Paper revisits entropy collapse from token-level entropy flow perspective
  • Entropy-decreasing tokens consistently outweigh entropy-increasing ones
  • Proposes On-Policy Entropy Flow Optimization (OP)
  • Provides unified explanation of entropy collapse in existing RLVR algorithms

Entities

Institutions

  • arXiv

Sources