Standard SFT-then-RL Pipeline Outperforms Mixed-Policy Methods After Bug Fixes

ai-technology · 2026-04-29

A recent preprint on arXiv indicates that the latest mixed-policy optimization techniques for LLM reasoning, which combine supervised and reinforcement learning signals, are based on flawed baselines due to two identified bugs. The first issue involves a bug in DeepSpeed's CPU-offloaded optimizer that inadvertently omits intermediate micro-batches during gradient accumulation, impacting tools like TRL, OpenRLHF, and Llama-Factory. The second issue pertains to a loss aggregation error in OpenRLHF that miscalculates the weighting of per-mini-batch losses. These problems hinder SFT performance, with the optimizer bug being the primary contributor to the discrepancy. Once rectified, the conventional SFT-then-RL pipeline outperforms all mixed-policy methods by +3.8 points on math benchmarks with Qwen2.5-Math-7B and +22.2 points with LLaMA, questioning the claimed benefits of mixed-policy strategies.

Key facts

Mixed-policy optimization methods for LLM reasoning rely on faulty baselines
Two bugs identified: DeepSpeed optimizer bug and OpenRLHF loss aggregation bug
DeepSpeed bug silently drops intermediate micro-batches during gradient accumulation
Bug affects TRL, OpenRLHF, and Llama-Factory frameworks
OpenRLHF bug incorrectly weights per-mini-batch losses
Corrected SFT-then-RL pipeline outperforms mixed-policy methods by +3.8 points on math benchmarks with Qwen2.5-Math-7B
Corrected pipeline outperforms by +22.2 points with LLaMA
Findings published on arXiv (2604.23747)

Entities

Institutions

DeepSpeed
TRL
OpenRLHF
Llama-Factory
arXiv

Sources

arXiv cs.AI — 2026-04-28