High-Quality Data Repetition Boosts German Language Model Training
A new study on arXiv (2604.28075) investigates the trade-off between data diversity and quality for German language modeling. Researchers applied hierarchical quality filters to 500 million web documents, comparing multi-epoch training on high-quality subsets against single-pass training on larger, less filtered corpora. Across multiple model scales and token budgets, repeating high-quality data consistently outperformed single-pass training on diverse data, with the performance gap persisting even after seven epochs. The findings challenge the assumption that diversity is always beneficial for non-English languages like German, French, or Japanese, where aggressive filtering may create a strategic dilemma. The study suggests that prioritizing quality over diversity through repetition can yield better sample efficiency.
Key facts
- Study investigates data filtering for German language modeling
- Hierarchical quality filters applied to 500 million web documents
- Compared multi-epoch training on filtered subsets vs single-pass on diverse corpus
- Multiple model scales and token budgets tested
- Repeating high-quality data outperformed single-pass training
- Performance gap persisted after 7 epochs
- Challenges assumption that diversity is always beneficial for non-English languages
- Focus on German, but implications for French, Japanese, and other languages
Entities
Institutions
- arXiv