Study Challenges Data Filtering for Large Model Pretraining

ai-technology · 2026-05-20

A new scaling study on large model pretraining in the high compute, data-scarce regime suggests that data filtering may be counterproductive. Contrary to common belief that high-quality data is essential, the research finds that sufficiently trained large parameter models benefit from low-quality and distractor data, and that the best filter is no filter at all.

Key facts

The study investigates data filtering for large model pretraining.
It targets the high compute, data-scarce regime.
Common belief holds that filtering to high-quality data is essential.
Experiments suggest with enough compute, no data filter is best.
Large parameter models tolerate and benefit from low-quality data.

Study Challenges Data Filtering for Large Model Pretraining

Key facts

Entities

Institutions

Sources