State Distribution, Not Loss Function, Drives LLM Post-Training

ai-technology · 2026-05-23

A recent study published on arXiv (2605.22731) posits that the distribution of states applied during supervision is more crucial for post-training in large language models than the loss function itself. The researchers define post-training as the process of shaping state distribution and perform controlled tests using Qwen3-0.6B-Base on GSM8K, evaluating retention with TruthfulQA and MMLU. Their results indicate that a gentle SFT run enhances performance on GSM8K with minimal forgetting, whereas a rigorous SFT run leads to significant retention loss. Furthermore, on-policy distillation from a compromised SFT teacher outperforms the teacher's own results. The research emphasizes three key observations: the effects of mild versus stress SFT, the advantages of on-policy distillation, and the significance of state distribution in influencing model behavior.

Key facts

Paper arXiv:2605.22731 analyzes LLM post-training methods
Focuses on state distribution rather than loss functions
Uses Qwen3-0.6B-Base model
Evaluated on GSM8K, TruthfulQA, and MMLU
Mild SFT improves GSM8K with little forgetting
Stress SFT causes substantial retention loss
On-policy distillation from degraded SFT teacher surpasses teacher
Formalizes post-training as state-distribution shaping

State Distribution, Not Loss Function, Drives LLM Post-Training

Key facts

Entities

Institutions

Sources