Self-Generated Data Boosts RL in Language Models
A new paper on arXiv (2605.08472) proposes using diverse self-generated data during mid-training to improve reinforcement learning (RL) in large language models (LLMs). The method, guided by George Polya's problem-solving framework, generates multiple correct answer variants for each training question before fine-tuning. The authors provide a theoretical analysis showing how policy-gradient updates incentivize combining multiple reasoning approaches. Empirical results demonstrate that this bootstrapped data-generation approach enhances RL effectiveness by exposing models to a wider range of reasoning strategies during training.
Key facts
- Paper published on arXiv with ID 2605.08472
- Focuses on improving RL in LLMs using self-generated data
- Uses George Polya's problem-solving approaches
- Generates multiple variants of correct answers per question
- Mid-training step occurs before RL training
- Theoretical analysis of policy-gradient updates
- Empirical results show improved RL effectiveness
- Addresses limitation of limited reasoning approaches in training data
Entities
Institutions
- arXiv