G2D: Combining Online and Offline RL for Efficient Language Model Reasoning

ai-technology · 2026-05-22

A novel technique known as G2D (GRPO to DPO) significantly lowers the computational expenses associated with Reinforcement Learning from Verifiable Rewards (RLVR) in language model reasoning. While GRPO, a prime example of RLVR, necessitates ongoing online rollout generation—an approach that is costly and challenging to scale—Direct Preference Optimization (DPO) serves as a more stable offline alternative. However, DPO often lags behind online methods like GRPO when utilizing cold supervised fine-tuned (SFT) policy rollouts. G2D employs a three-phase process: initiating with a brief GRPO warm-up, creating a static preference dataset, followed by offline fine-tuning with DPO. Tests conducted on Qwen2.5-7B and Llama-3.1-8B indicate that offline DPO with a moderate warm-up can match or exceed GRPO's performance at a significantly reduced computational cost. Specifically, for Qwen2.5-7B, G2D at K=150 delivers competitive outcomes.

Key facts

G2D is a three-stage pipeline: GRPO warm-up, static preference dataset construction, offline DPO fine-tuning.
GRPO requires continuous online rollout generation, making it computationally expensive.
DPO is a stable offline alternative but typically underperforms online RL methods like GRPO.
G2D matches or outperforms GRPO at lower compute cost on Qwen2.5-7B and Llama-3.1-8B.
Experiments were conducted on Qwen2.5-7B and Llama-3.1-8B models.
G2D at K=150 on Qwen2.5-7B achieves competitive results.
The method addresses scalability issues in RLVR for language model reasoning.
The paper is available on arXiv with ID 2605.21266.

G2D: Combining Online and Offline RL for Efficient Language Model Reasoning

Key facts

Entities

Institutions

Sources