ARTFEED — Contemporary Art Intelligence

EXPO: New RLVR Algorithm Improves LLM Math Reasoning

ai-technology · 2026-05-12

A new paper on arXiv (2605.09923) introduces Exploration-Prioritized Policy Optimization (EXPO), a method to enhance reinforcement learning for large language models (LLMs) in mathematical reasoning. The authors identify two inefficiencies in the standard Group Relative Policy Optimization (GRPO) algorithm: a fixed KL penalty coefficient that limits policy exploration, and uniform sampling of training questions that fails to prioritize moderately difficult problems. EXPO proposes two lightweight plug-in modules: Accuracy-Conditioned KL Scaling (AKL), which dynamically adjusts KL regularization based on batch accuracy, and Gaussian Curriculum Sampling, which focuses training on questions of moderate difficulty. The approach aims to improve the efficiency and effectiveness of RLVR for LLMs.

Key facts

  • Paper published on arXiv with ID 2605.09923
  • Proposes Exploration-Prioritized Policy Optimization (EXPO)
  • Addresses inefficiencies in Group Relative Policy Optimization (GRPO)
  • Introduces Accuracy-Conditioned KL Scaling (AKL) module
  • Introduces Gaussian Curriculum Sampling module
  • Focuses on mathematical reasoning for LLMs
  • RLVR stands for Reinforcement Learning with Verifiable Rewards
  • Paper is categorized as new announcement on arXiv

Entities

Institutions

  • arXiv

Sources