Self-Generated Data Boosts RL in Language Models

ai-technology · 2026-05-12

A new paper on arXiv (2605.08472) proposes using diverse self-generated data during mid-training to improve reinforcement learning (RL) in large language models (LLMs). The method, guided by George Polya's problem-solving framework, generates multiple correct answer variants for each training question before fine-tuning. The authors provide a theoretical analysis showing how policy-gradient updates incentivize combining multiple reasoning approaches. Empirical results demonstrate that this bootstrapped data-generation approach enhances RL effectiveness by exposing models to a wider range of reasoning strategies during training.

Key facts

Paper published on arXiv with ID 2605.08472
Focuses on improving RL in LLMs using self-generated data
Uses George Polya's problem-solving approaches
Generates multiple variants of correct answers per question
Mid-training step occurs before RL training
Theoretical analysis of policy-gradient updates
Empirical results show improved RL effectiveness
Addresses limitation of limited reasoning approaches in training data

Self-Generated Data Boosts RL in Language Models

Key facts

Entities

Institutions

Sources