AGPO: Adaptive Group Policy Optimization Improves LLM Reasoning

ai-technology · 2026-05-22

A novel reinforcement learning technique known as Adaptive Group Policy Optimization (AGPO) enhances the reasoning capabilities of large language models (LLMs) by leveraging group-level statistics to manage the magnitude of updates and exploration. This critic-free enhancement of GRPO utilizes a shared statistical state derived from probes to facilitate adaptive clipping and bidirectional adaptive temperature sampling. In tests across nine math and STEM benchmarks in English and Chinese, Qwen2.5-14B, which was trained using AGPO, surpassed PPO/GRPO within the same token generation limits, achieving scores of 67.3% on GSM8K and 40.5% on MATH. These improvements are also applicable to Llama-3-8B.

Key facts

AGPO is a critic-free refinement of GRPO
Uses group-level statistics to control update magnitude and exploration
Adaptive clipping sets trust-region size from reward dispersion, skewness, probe vote entropy, policy entropy, and step-wise KL drift
Bidirectional adaptive temperature sampling heats or cools decoding around a base temperature
Tested on nine English and Chinese math/STEM benchmarks
Qwen2.5-14B with AGPO achieves 67.3% on GSM8K and 40.5% on MATH
Outperforms PPO/GRPO under same generated-token budget
Gains transfer to Llama-3-8B

AGPO: Adaptive Group Policy Optimization Improves LLM Reasoning

Key facts

Entities

Institutions

Sources