SePT: Self-Training Boosts LLM Reasoning Without External Rewards

ai-technology · 2026-05-18

The Self-evolving Post-Training (SePT) technique allows language models to enhance their reasoning capabilities solely through their own generated responses, without the need for external rewards. This method involves a cycle of self-generation and training on the data it produces, utilizing an online data refresh system where each new set is created by the latest model iteration. In evaluations across six math reasoning benchmarks, SePT surpasses a robust no-training baseline on various models. Ablation studies highlight the significance of both online data refresh and temperature dynamics.

Key facts

SePT stands for Self-evolving Post-Training.
The method uses only the model's own sampled responses for training.
It alternates between self-generation and training on self-generated responses.
An online data refresh mechanism is used, where each new batch comes from the latest model.
SePT was tested across six math reasoning benchmarks.
It improves upon a no-training baseline evaluated at best swept decoding temperature.
Ablations show the importance of online data refresh and temperature dynamics.
The paper is available on arXiv with ID 2510.18814.

SePT: Self-Training Boosts LLM Reasoning Without External Rewards

Key facts

Entities

Institutions

Sources