Marco-MoE: Open Multilingual Sparse MoE Models with Efficient Upcycling
A new suite of fully open multilingual sparse Mixture-of-Experts (MoE) language models, named Marco-MoE, has been introduced by researchers. These models utilize only around 5% of their total parameters for each input token, resulting in significant sparsity. By leveraging upcycling from dense models, they efficiently pre-train on 5 trillion tokens. Marco-MoE outperforms similarly sized rivals on English and multilingual benchmarks, demonstrating an exceptional performance-to-compute ratio. Variants trained for instruction surpass models that have 3–14 times more active parameters. Analysis indicates that Marco-MoE identifies structured expert activation patterns common among related languages while also catering to those that are linguistically distinct, facilitating scalable language growth without overlap.
Key facts
- Marco-MoE is a suite of open multilingual sparse Mixture-of-Experts models.
- Only about 5% of total parameters are activated per input token.
- Upcycling from dense models enables efficient pre-training on 5T tokens.
- Models surpass similarly-sized competitors on English and multilingual benchmarks.
- Instruct variants outperform models with 3–14× more activated parameters.
- Structured expert activation patterns are shared across related languages.
- Specialized utilization is maintained for linguistically isolated languages.
- Scalable language expansion is possible without interference.
Entities
—