ARTFEED — Contemporary Art Intelligence

High-Quality Data Repetition Boosts German Language Model Training

publication · 2026-05-01

A new study on arXiv (2604.28075) investigates the trade-off between data diversity and quality for German language modeling. Researchers applied hierarchical quality filters to 500 million web documents, comparing multi-epoch training on high-quality subsets against single-pass training on larger, less filtered corpora. Across multiple model scales and token budgets, repeating high-quality data consistently outperformed single-pass training on diverse data, with the performance gap persisting even after seven epochs. The findings challenge the assumption that diversity is always beneficial for non-English languages like German, French, or Japanese, where aggressive filtering may create a strategic dilemma. The study suggests that prioritizing quality over diversity through repetition can yield better sample efficiency.

Key facts

  • Study investigates data filtering for German language modeling
  • Hierarchical quality filters applied to 500 million web documents
  • Compared multi-epoch training on filtered subsets vs single-pass on diverse corpus
  • Multiple model scales and token budgets tested
  • Repeating high-quality data outperformed single-pass training
  • Performance gap persisted after 7 epochs
  • Challenges assumption that diversity is always beneficial for non-English languages
  • Focus on German, but implications for French, Japanese, and other languages

Entities

Institutions

  • arXiv

Sources