Optimizing Language Mixture Ratio for Llama-3 Continual Pre-Training

ai-technology · 2026-04-30

A new arXiv paper (2409.06624) investigates the optimal selection of Additional Language Mixture Ratio (ALMR) and Learning Rate (LR) for Continual Pre-Training (CPT) of Large Language Models (LLMs) to enhance Chinese language abilities. The study performs CPT on Llama-3 8B and 70B models, establishing a correlation between ALMR and LR on the 8B size that directly indicates optimal experimental setup. Through hyper-parameter tuning and subsequent fine-tuning, model performance improves on Chinese-related benchmarks and specific domains. The research addresses the gap between experimental scaling laws and full-size model deployment, providing systematic guidance for CPT hyper-parameter selection.

Key facts

Paper arXiv:2409.06624v4
Focuses on Continual Pre-Training (CPT) for Llama-3 8B and 70B
Enhances Chinese language ability
Studies optimal Additional Language Mixture Ratio (ALMR) and Learning Rate (LR)
Bridges gap between experimental scaling law and full model deployment
Improves performance on Chinese-related benchmarks
Hyper-parameter tuning and fine-tuning involved
Published on arXiv

Optimizing Language Mixture Ratio for Llama-3 Continual Pre-Training

Key facts

Entities

Institutions

Sources