ARTFEED — Contemporary Art Intelligence

State Distribution, Not Loss Function, Drives LLM Post-Training

ai-technology · 2026-05-23

A recent study published on arXiv (2605.22731) posits that the distribution of states applied during supervision is more crucial for post-training in large language models than the loss function itself. The researchers define post-training as the process of shaping state distribution and perform controlled tests using Qwen3-0.6B-Base on GSM8K, evaluating retention with TruthfulQA and MMLU. Their results indicate that a gentle SFT run enhances performance on GSM8K with minimal forgetting, whereas a rigorous SFT run leads to significant retention loss. Furthermore, on-policy distillation from a compromised SFT teacher outperforms the teacher's own results. The research emphasizes three key observations: the effects of mild versus stress SFT, the advantages of on-policy distillation, and the significance of state distribution in influencing model behavior.

Key facts

  • Paper arXiv:2605.22731 analyzes LLM post-training methods
  • Focuses on state distribution rather than loss functions
  • Uses Qwen3-0.6B-Base model
  • Evaluated on GSM8K, TruthfulQA, and MMLU
  • Mild SFT improves GSM8K with little forgetting
  • Stress SFT causes substantial retention loss
  • On-policy distillation from degraded SFT teacher surpasses teacher
  • Formalizes post-training as state-distribution shaping

Entities

Institutions

  • arXiv

Sources