EXPO: New RLVR Algorithm Improves LLM Math Reasoning

ai-technology · 2026-05-12

A new paper on arXiv (2605.09923) introduces Exploration-Prioritized Policy Optimization (EXPO), a method to enhance reinforcement learning for large language models (LLMs) in mathematical reasoning. The authors identify two inefficiencies in the standard Group Relative Policy Optimization (GRPO) algorithm: a fixed KL penalty coefficient that limits policy exploration, and uniform sampling of training questions that fails to prioritize moderately difficult problems. EXPO proposes two lightweight plug-in modules: Accuracy-Conditioned KL Scaling (AKL), which dynamically adjusts KL regularization based on batch accuracy, and Gaussian Curriculum Sampling, which focuses training on questions of moderate difficulty. The approach aims to improve the efficiency and effectiveness of RLVR for LLMs.

Key facts

Paper published on arXiv with ID 2605.09923
Proposes Exploration-Prioritized Policy Optimization (EXPO)
Addresses inefficiencies in Group Relative Policy Optimization (GRPO)
Introduces Accuracy-Conditioned KL Scaling (AKL) module
Introduces Gaussian Curriculum Sampling module
Focuses on mathematical reasoning for LLMs
RLVR stands for Reinforcement Learning with Verifiable Rewards
Paper is categorized as new announcement on arXiv

EXPO: New RLVR Algorithm Improves LLM Math Reasoning

Key facts

Entities

Institutions

Sources