Pretraining Data Determines LLM Loss-to-Loss Scaling
A new study reveals that pretraining data is the primary factor influencing loss-to-loss scaling in large language models (LLMs), while model size, optimization hyperparameters, tokenizers, and architectural differences have limited impact. The research, published on arXiv (2502.12120v3), compared transformer-based models like Llama and state-space models like Mamba. Findings suggest practitioners should prioritize curating suitable pretraining datasets for optimal downstream performance, as architectures and other settings can be freely optimized without significantly affecting scaling trends.
Key facts
- Pretraining data determines loss-to-loss scaling trends.
- Model size, optimization hyperparameters, tokenizers, and architectural differences have limited impact.
- Study compared Llama (transformer) and Mamba (state-space) models.
- Published on arXiv with ID 2502.12120v3.
- Loss-to-loss scaling relates losses across pretraining datasets and downstream tasks.
- Scaling laws guide optimal balance of model size, tokens, and compute.
- Practitioners should carefully curate pretraining datasets.
- Architectures and other settings can be freely optimized.
Entities
Institutions
- arXiv