EP-GRPO: Entropy-Progress Aligned Reinforcement Learning for LLMs
A novel reinforcement learning framework named Entropy-Progress Aligned Group Relative Policy Optimization (EP-GRPO) has been introduced to tackle credit assignment issues found in current methods like GRPO. This framework, outlined in a preprint on arXiv (2605.04960), addresses three primary challenges: the uniform token-level granularity that overlooks varying informational values, the uniform polarity that incorrectly penalizes accurate actions while rewarding errors, and the zero-variance collapse that diminishes outcome-driven gradients. By employing entropy-gated modulation, EP-GRPO emphasizes high-entropy decision points and implicit signals from policy divergence linked to outcome benefits. The framework effectively quantifies these shortcomings, highlighting significant disparities in token informativeness, pervasive misalignment in step-level polarity, and considerable training inefficiencies. This research enhances LLM reasoning by offering dense, self-supervised guidance through the model's inherent information flow.
Key facts
- EP-GRPO is proposed to address credit assignment failures in GRPO.
- Three failures: uniform token granularity, uniform polarity, zero-variance collapse.
- EP-GRPO uses entropy-gated modulation and implicit process signals.
- The framework mines the model's intrinsic information flow for guidance.
- Preprint available on arXiv with ID 2605.04960.
- Reinforcement learning with verifiable rewards (RLVR) is the broader context.
- The approach aims to improve LLM reasoning.
- Systematic quantification of failures shows non-uniform token informativeness.
Entities
Institutions
- arXiv