ARTFEED — Contemporary Art Intelligence

Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection

ai-technology · 2026-04-24

A recent preprint on arXiv, identified by ID 2604.20549, examines the use of cross-lingual quality classifiers for selecting multilingual pretraining data in Large Language Models (LLMs). This research investigates whether quality indicators in embedding spaces maintain consistency across languages, allowing high-resource languages to aid in the filtration of low-resource ones. The study assessed several methods, such as cross-lingual transfer, third quartile sampling (Q3), and tuning retention rates. Findings indicate that extensive multilingual pooling frequently surpasses monolingual baselines in terms of rank stability and overall accuracy for a model with 1 billion parameters trained on 103 billion tokens. For French, a high-resource language, the aggregate normalized accuracy rose by 1.2%. This approach effectively addresses the scarcity of high-quality native data necessary for developing robust quality classifiers across various languages, emphasizing the optimization of signal-to-noise ratio over sheer data volume.

Key facts

  • arXiv ID: 2604.20549
  • Investigates cross-lingual quality classifiers for LLM pretraining data selection
  • Evaluates cross-lingual transfer, Q3 sampling, and retention rate tuning
  • Massive multilingual pooling outperforms monolingual baselines
  • 1B parameter model trained on 103B tokens
  • French aggregate normalized accuracy increased by 1.2%
  • Addresses insufficient native high-quality data for low-resource languages
  • Shifts data curation from volume to signal-to-noise ratio

Entities

Institutions

  • arXiv

Sources