Standard SFT-then-RL Pipeline Outperforms Mixed-Policy Methods After Bug Fixes
A recent preprint on arXiv indicates that the latest mixed-policy optimization techniques for LLM reasoning, which combine supervised and reinforcement learning signals, are based on flawed baselines due to two identified bugs. The first issue involves a bug in DeepSpeed's CPU-offloaded optimizer that inadvertently omits intermediate micro-batches during gradient accumulation, impacting tools like TRL, OpenRLHF, and Llama-Factory. The second issue pertains to a loss aggregation error in OpenRLHF that miscalculates the weighting of per-mini-batch losses. These problems hinder SFT performance, with the optimizer bug being the primary contributor to the discrepancy. Once rectified, the conventional SFT-then-RL pipeline outperforms all mixed-policy methods by +3.8 points on math benchmarks with Qwen2.5-Math-7B and +22.2 points with LLaMA, questioning the claimed benefits of mixed-policy strategies.
Key facts
- Mixed-policy optimization methods for LLM reasoning rely on faulty baselines
- Two bugs identified: DeepSpeed optimizer bug and OpenRLHF loss aggregation bug
- DeepSpeed bug silently drops intermediate micro-batches during gradient accumulation
- Bug affects TRL, OpenRLHF, and Llama-Factory frameworks
- OpenRLHF bug incorrectly weights per-mini-batch losses
- Corrected SFT-then-RL pipeline outperforms mixed-policy methods by +3.8 points on math benchmarks with Qwen2.5-Math-7B
- Corrected pipeline outperforms by +22.2 points with LLaMA
- Findings published on arXiv (2604.23747)
Entities
Institutions
- DeepSpeed
- TRL
- OpenRLHF
- Llama-Factory
- arXiv