G2D: Combining Online and Offline RL for Efficient Language Model Reasoning
A novel technique known as G2D (GRPO to DPO) significantly lowers the computational expenses associated with Reinforcement Learning from Verifiable Rewards (RLVR) in language model reasoning. While GRPO, a prime example of RLVR, necessitates ongoing online rollout generation—an approach that is costly and challenging to scale—Direct Preference Optimization (DPO) serves as a more stable offline alternative. However, DPO often lags behind online methods like GRPO when utilizing cold supervised fine-tuned (SFT) policy rollouts. G2D employs a three-phase process: initiating with a brief GRPO warm-up, creating a static preference dataset, followed by offline fine-tuning with DPO. Tests conducted on Qwen2.5-7B and Llama-3.1-8B indicate that offline DPO with a moderate warm-up can match or exceed GRPO's performance at a significantly reduced computational cost. Specifically, for Qwen2.5-7B, G2D at K=150 delivers competitive outcomes.
Key facts
- G2D is a three-stage pipeline: GRPO warm-up, static preference dataset construction, offline DPO fine-tuning.
- GRPO requires continuous online rollout generation, making it computationally expensive.
- DPO is a stable offline alternative but typically underperforms online RL methods like GRPO.
- G2D matches or outperforms GRPO at lower compute cost on Qwen2.5-7B and Llama-3.1-8B.
- Experiments were conducted on Qwen2.5-7B and Llama-3.1-8B models.
- G2D at K=150 on Qwen2.5-7B achieves competitive results.
- The method addresses scalability issues in RLVR for language model reasoning.
- The paper is available on arXiv with ID 2605.21266.
Entities
Institutions
- arXiv