FG-ExPO: New RL Method Improves LLM Math Reasoning

ai-technology · 2026-05-12

Researchers propose FG-ExPO (Frontier-Guided Exploration-Prioritized Policy Optimization), a new reinforcement learning method for improving LLM mathematical reasoning. The method addresses two inefficiencies in Group Relative Policy Optimization (GRPO): a fixed KL coefficient that overly restricts exploration, and uniform question sampling that ignores the value of moderately difficult problems. FG-ExPO integrates Accuracy-Conditioned KL Scaling (AKL), which adjusts the KL penalty based on batch average accuracy, and a Gaussian Curriculum that prioritizes informative training examples. The approach is designed for Reinforcement Learning with Verifiable Rewards (RLVR), the standard paradigm for LLM math reasoning. The paper is available on arXiv.

Key facts

FG-ExPO stands for Frontier-Guided Exploration-Prioritized Policy Optimization
It addresses two inefficiencies in GRPO: fixed KL coefficient and uniform question sampling
Accuracy-Conditioned KL Scaling (AKL) adjusts KL penalty based on batch accuracy
Gaussian Curriculum prioritizes moderately difficult problems
RLVR is the standard paradigm for LLM mathematical reasoning
GRPO is the dominant algorithm for RLVR
The paper is on arXiv with ID 2605.11403
The method is designed for LLM reasoning tasks

FG-ExPO: New RL Method Improves LLM Math Reasoning

Key facts

Entities

Institutions

Sources