AGPO: Adaptive Group Policy Optimization Improves LLM Reasoning
A novel reinforcement learning technique known as Adaptive Group Policy Optimization (AGPO) enhances the reasoning capabilities of large language models (LLMs) by leveraging group-level statistics to manage the magnitude of updates and exploration. This critic-free enhancement of GRPO utilizes a shared statistical state derived from probes to facilitate adaptive clipping and bidirectional adaptive temperature sampling. In tests across nine math and STEM benchmarks in English and Chinese, Qwen2.5-14B, which was trained using AGPO, surpassed PPO/GRPO within the same token generation limits, achieving scores of 67.3% on GSM8K and 40.5% on MATH. These improvements are also applicable to Llama-3-8B.
Key facts
- AGPO is a critic-free refinement of GRPO
- Uses group-level statistics to control update magnitude and exploration
- Adaptive clipping sets trust-region size from reward dispersion, skewness, probe vote entropy, policy entropy, and step-wise KL drift
- Bidirectional adaptive temperature sampling heats or cools decoding around a base temperature
- Tested on nine English and Chinese math/STEM benchmarks
- Qwen2.5-14B with AGPO achieves 67.3% on GSM8K and 40.5% on MATH
- Outperforms PPO/GRPO under same generated-token budget
- Gains transfer to Llama-3-8B
Entities
Institutions
- arXiv