Self-Training Works Only When Synthetic Data Matches the Model

ai-technology · 2026-06-01

A recent study published on arXiv (2605.31126) indicates that language models can enhance their performance by training on text they produce themselves, provided certain compatibility conditions are met, without relying on prompts, instructors, or reward systems. The authors introduce the 'latent capability resurfacing hypothesis,' suggesting that weak self-training boosts pre-existing capabilities when the generated text aligns well with the student model. This compatibility is based on relationships rather than the data's inherent qualities. The research emphasizes prompt-free unconditional self-training, where base models are refined using text generated exclusively from the BOS token. Key findings include the relational nature of synthetic utility, the superiority of same-lineage transfer over stronger, differently trained sources, and the inadequacy of cross-family transfer. The full paper can be accessed at arXiv:2605.31126.

Key facts

arXiv paper 2605.31126
Title: Not All Synthetic Data Is Yours to Learn From
Language models can improve from self-generated text without prompts or supervision
Compatibility between synthetic corpus and student model is required
Latent capability resurfacing hypothesis proposed
Prompt-free unconditional self-training setting studied
Self-generated data is most effective source
Same-lineage transfer outperforms cross-family transfer

Self-Training Works Only When Synthetic Data Matches the Model

Key facts

Entities

Institutions

Sources