ARTFEED — Contemporary Art Intelligence

Study Challenges Data Filtering for Large Model Pretraining

ai-technology · 2026-05-20

A new scaling study on large model pretraining in the high compute, data-scarce regime suggests that data filtering may be counterproductive. Contrary to common belief that high-quality data is essential, the research finds that sufficiently trained large parameter models benefit from low-quality and distractor data, and that the best filter is no filter at all.

Key facts

  • The study investigates data filtering for large model pretraining.
  • It targets the high compute, data-scarce regime.
  • Common belief holds that filtering to high-quality data is essential.
  • Experiments suggest with enough compute, no data filter is best.
  • Large parameter models tolerate and benefit from low-quality data.

Entities

Institutions

  • arXiv

Sources