Self-Training Works Only When Synthetic Data Matches the Model
A recent study published on arXiv (2605.31126) indicates that language models can enhance their performance by training on text they produce themselves, provided certain compatibility conditions are met, without relying on prompts, instructors, or reward systems. The authors introduce the 'latent capability resurfacing hypothesis,' suggesting that weak self-training boosts pre-existing capabilities when the generated text aligns well with the student model. This compatibility is based on relationships rather than the data's inherent qualities. The research emphasizes prompt-free unconditional self-training, where base models are refined using text generated exclusively from the BOS token. Key findings include the relational nature of synthetic utility, the superiority of same-lineage transfer over stronger, differently trained sources, and the inadequacy of cross-family transfer. The full paper can be accessed at arXiv:2605.31126.
Key facts
- arXiv paper 2605.31126
- Title: Not All Synthetic Data Is Yours to Learn From
- Language models can improve from self-generated text without prompts or supervision
- Compatibility between synthetic corpus and student model is required
- Latent capability resurfacing hypothesis proposed
- Prompt-free unconditional self-training setting studied
- Self-generated data is most effective source
- Same-lineage transfer outperforms cross-family transfer
Entities
Institutions
- arXiv