Optimizing Language Mixture Ratio for Llama-3 Continual Pre-Training
A new arXiv paper (2409.06624) investigates the optimal selection of Additional Language Mixture Ratio (ALMR) and Learning Rate (LR) for Continual Pre-Training (CPT) of Large Language Models (LLMs) to enhance Chinese language abilities. The study performs CPT on Llama-3 8B and 70B models, establishing a correlation between ALMR and LR on the 8B size that directly indicates optimal experimental setup. Through hyper-parameter tuning and subsequent fine-tuning, model performance improves on Chinese-related benchmarks and specific domains. The research addresses the gap between experimental scaling laws and full-size model deployment, providing systematic guidance for CPT hyper-parameter selection.
Key facts
- Paper arXiv:2409.06624v4
- Focuses on Continual Pre-Training (CPT) for Llama-3 8B and 70B
- Enhances Chinese language ability
- Studies optimal Additional Language Mixture Ratio (ALMR) and Learning Rate (LR)
- Bridges gap between experimental scaling law and full model deployment
- Improves performance on Chinese-related benchmarks
- Hyper-parameter tuning and fine-tuning involved
- Published on arXiv
Entities
Institutions
- arXiv