Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection
A recent preprint on arXiv, identified by ID 2604.20549, examines the use of cross-lingual quality classifiers for selecting multilingual pretraining data in Large Language Models (LLMs). This research investigates whether quality indicators in embedding spaces maintain consistency across languages, allowing high-resource languages to aid in the filtration of low-resource ones. The study assessed several methods, such as cross-lingual transfer, third quartile sampling (Q3), and tuning retention rates. Findings indicate that extensive multilingual pooling frequently surpasses monolingual baselines in terms of rank stability and overall accuracy for a model with 1 billion parameters trained on 103 billion tokens. For French, a high-resource language, the aggregate normalized accuracy rose by 1.2%. This approach effectively addresses the scarcity of high-quality native data necessary for developing robust quality classifiers across various languages, emphasizing the optimization of signal-to-noise ratio over sheer data volume.
Key facts
- arXiv ID: 2604.20549
- Investigates cross-lingual quality classifiers for LLM pretraining data selection
- Evaluates cross-lingual transfer, Q3 sampling, and retention rate tuning
- Massive multilingual pooling outperforms monolingual baselines
- 1B parameter model trained on 103B tokens
- French aggregate normalized accuracy increased by 1.2%
- Addresses insufficient native high-quality data for low-resource languages
- Shifts data curation from volume to signal-to-noise ratio
Entities
Institutions
- arXiv