SFT and RL Cannot Be Decoupled in LLM Post-Training

other · 2026-05-07

A recent theoretical study demonstrates that supervised fine-tuning (SFT) and reinforcement learning (RL) are interdependent during the post-training phase of large language models. The research reveals that the practice of alternating between SFT and RL, typical in contemporary reasoning models, results in mutual deterioration: RL amplifies SFT loss in both distributional (KL-based) and landscape (PL-based) evaluations, while SFT diminishes the rewards obtained through RL in similar contexts. The authors establish the optimal duration for RL under the PL condition to optimize reward enhancement while minimizing SFT degradation and pinpoint the threshold for non-decoupling. This paper can be accessed on arXiv.

Key facts

Supervised fine-tuning (SFT) and reinforcement learning (RL) cannot be decoupled in LLM post-training.
RL increases SFT loss under both KL-based and PL-based analyses.
SFT lowers the reward achieved by RL under analogous conditions.
The optimal RL duration balances reward improvement against SFT degradation under the PL condition.
The non-decoupling threshold is identified.
Modern reasoning models widely alternate SFT and RL training.
The paper is published on arXiv with ID 2601.07389.
The study provides theoretical proof of the non-decoupling.

SFT and RL Cannot Be Decoupled in LLM Post-Training

Key facts

Entities

Institutions

Sources