ARTFEED — Contemporary Art Intelligence

Standard SFT-then-RL Pipeline Outperforms Mixed-Policy Methods After Bug Fixes

ai-technology · 2026-04-29

A recent preprint on arXiv indicates that the latest mixed-policy optimization techniques for LLM reasoning, which combine supervised and reinforcement learning signals, are based on flawed baselines due to two identified bugs. The first issue involves a bug in DeepSpeed's CPU-offloaded optimizer that inadvertently omits intermediate micro-batches during gradient accumulation, impacting tools like TRL, OpenRLHF, and Llama-Factory. The second issue pertains to a loss aggregation error in OpenRLHF that miscalculates the weighting of per-mini-batch losses. These problems hinder SFT performance, with the optimizer bug being the primary contributor to the discrepancy. Once rectified, the conventional SFT-then-RL pipeline outperforms all mixed-policy methods by +3.8 points on math benchmarks with Qwen2.5-Math-7B and +22.2 points with LLaMA, questioning the claimed benefits of mixed-policy strategies.

Key facts

  • Mixed-policy optimization methods for LLM reasoning rely on faulty baselines
  • Two bugs identified: DeepSpeed optimizer bug and OpenRLHF loss aggregation bug
  • DeepSpeed bug silently drops intermediate micro-batches during gradient accumulation
  • Bug affects TRL, OpenRLHF, and Llama-Factory frameworks
  • OpenRLHF bug incorrectly weights per-mini-batch losses
  • Corrected SFT-then-RL pipeline outperforms mixed-policy methods by +3.8 points on math benchmarks with Qwen2.5-Math-7B
  • Corrected pipeline outperforms by +22.2 points with LLaMA
  • Findings published on arXiv (2604.23747)

Entities

Institutions

  • DeepSpeed
  • TRL
  • OpenRLHF
  • Llama-Factory
  • arXiv

Sources