ARTFEED — Contemporary Art Intelligence

AGPO: Adaptive Group Policy Optimization Improves LLM Reasoning

ai-technology · 2026-05-22

A novel reinforcement learning technique known as Adaptive Group Policy Optimization (AGPO) enhances the reasoning capabilities of large language models (LLMs) by leveraging group-level statistics to manage the magnitude of updates and exploration. This critic-free enhancement of GRPO utilizes a shared statistical state derived from probes to facilitate adaptive clipping and bidirectional adaptive temperature sampling. In tests across nine math and STEM benchmarks in English and Chinese, Qwen2.5-14B, which was trained using AGPO, surpassed PPO/GRPO within the same token generation limits, achieving scores of 67.3% on GSM8K and 40.5% on MATH. These improvements are also applicable to Llama-3-8B.

Key facts

  • AGPO is a critic-free refinement of GRPO
  • Uses group-level statistics to control update magnitude and exploration
  • Adaptive clipping sets trust-region size from reward dispersion, skewness, probe vote entropy, policy entropy, and step-wise KL drift
  • Bidirectional adaptive temperature sampling heats or cools decoding around a base temperature
  • Tested on nine English and Chinese math/STEM benchmarks
  • Qwen2.5-14B with AGPO achieves 67.3% on GSM8K and 40.5% on MATH
  • Outperforms PPO/GRPO under same generated-token budget
  • Gains transfer to Llama-3-8B

Entities

Institutions

  • arXiv

Sources