Introspective Training Boosts LLM Scaling Across All Stages
A new method called Introspective Training (IXT) improves scaling efficiency across all stages of large language model training, from pre-training to post-training. Inspired by offline reward-conditioned reinforcement learning, IXT uses a thinking reward model to annotate training data with natural language critique feedback, enabling quality-aware training from the earliest stages. By prefix-conditioning data with generated feedback, the method ensures that not all tokens are treated equally. Experiments on 7.5-12B transformer-based dense LLMs trained from scratch up to 18 trillion tokens show that IXT improves scaling across all training stages. The paper is available on arXiv with ID 2605.20285.
Key facts
- Introspective Training (IXT) is proposed for efficient scaling across LLM training stages.
- IXT is inspired by offline reward-conditioned reinforcement learning.
- It uses a thinking reward model to annotate data with natural language critique feedback.
- Data is prefix-conditioned with generated feedback for quality-aware training.
- Experiments conducted on 7.5-12B transformer-based dense LLMs.
- Models trained from scratch up to 18 trillion tokens.
- IXT improves scaling across all stages of training.
- Paper available on arXiv with ID 2605.20285.
Entities
Institutions
- arXiv