Study Challenges Data Filtering for Large Model Pretraining
A new scaling study on large model pretraining in the high compute, data-scarce regime suggests that data filtering may be counterproductive. Contrary to common belief that high-quality data is essential, the research finds that sufficiently trained large parameter models benefit from low-quality and distractor data, and that the best filter is no filter at all.
Key facts
- The study investigates data filtering for large model pretraining.
- It targets the high compute, data-scarce regime.
- Common belief holds that filtering to high-quality data is essential.
- Experiments suggest with enough compute, no data filter is best.
- Large parameter models tolerate and benefit from low-quality data.
Entities
Institutions
- arXiv