FG-ExPO: New RL Method Improves LLM Math Reasoning
Researchers propose FG-ExPO (Frontier-Guided Exploration-Prioritized Policy Optimization), a new reinforcement learning method for improving LLM mathematical reasoning. The method addresses two inefficiencies in Group Relative Policy Optimization (GRPO): a fixed KL coefficient that overly restricts exploration, and uniform question sampling that ignores the value of moderately difficult problems. FG-ExPO integrates Accuracy-Conditioned KL Scaling (AKL), which adjusts the KL penalty based on batch average accuracy, and a Gaussian Curriculum that prioritizes informative training examples. The approach is designed for Reinforcement Learning with Verifiable Rewards (RLVR), the standard paradigm for LLM math reasoning. The paper is available on arXiv.
Key facts
- FG-ExPO stands for Frontier-Guided Exploration-Prioritized Policy Optimization
- It addresses two inefficiencies in GRPO: fixed KL coefficient and uniform question sampling
- Accuracy-Conditioned KL Scaling (AKL) adjusts KL penalty based on batch accuracy
- Gaussian Curriculum prioritizes moderately difficult problems
- RLVR is the standard paradigm for LLM mathematical reasoning
- GRPO is the dominant algorithm for RLVR
- The paper is on arXiv with ID 2605.11403
- The method is designed for LLM reasoning tasks
Entities
Institutions
- arXiv