GXPO: Efficient Multi-Step Lookahead for LLM Reasoning RL
A novel technique known as Gradient Extrapolation-Based Policy Optimization (GXPO) enhances reinforcement learning efficiency for large language models. Traditional GRPO training updates the model solely based on the current step, whereas comprehensive multi-step lookahead yields superior updates but necessitates multiple backward passes. GXPO simulates an extended local lookahead with just three backward passes during an active phase, utilizing the same batch of rollouts, rewards, advantages, and GRPO loss. It executes two rapid optimizer steps, evaluates gradient shifts, forecasts a virtual K-step lookahead point, advances the policy towards that point, and implements a corrective adjustment. This method lowers computational expenses while preserving the advantages of multi-step lookahead, making it ideal for reasoning tasks with verifiable answers.
Key facts
- GXPO stands for Gradient Extrapolation-Based Policy Optimization
- It is a plug-compatible policy-update rule for GRPO-style reasoning RL
- GXPO approximates longer local lookahead using only three backward passes
- It reuses the same batch of rollouts, rewards, advantages, and GRPO loss
- The method takes two fast optimizer steps and measures gradient changes
- It predicts a virtual K-step lookahead point and moves the policy partway
- Standard GRPO updates using only the current step
- Full multi-step lookahead is too expensive due to many backward passes
Entities
Institutions
- arXiv