MobileMoE: On-Device MoE Language Models Achieve New Pareto Frontier
MobileMoE introduces a series of on-device Mixture-of-Experts (MoE) language models, featuring 0.3-0.9B active parameters and a total of 1.3-5.3B parameters, setting a new benchmark for on-device LLMs. The study presents a scaling law for on-device MoE that optimally balances architecture within mobile memory and computational limits, pinpointing an ideal combination of moderate sparsity and finely-tuned shared experts. Training occurs through a four-phase process (pre-training, mid-training, instruction fine-tuning, quantization-aware training) utilizing open-source datasets. In evaluations across 14 benchmarks, MobileMoE either matches or surpasses the performance of current models.
Key facts
- MobileMoE models have 0.3-0.9B active parameters and 1.3-5.3B total parameters.
- The scaling law optimizes MoE architecture for mobile memory and compute constraints.
- Optimal configuration uses moderate sparsity with fine-grained and shared experts.
- Training includes pre-training, mid-training, instruction fine-tuning, and quantization-aware training.
- All training data is from open-source datasets.
- MobileMoE is evaluated on 14 benchmarks.
- The models establish a new Pareto frontier for on-device LLMs.
- The work is published on arXiv (2605.27358).
Entities
Institutions
- arXiv