ML-Embed: Efficient Multilingual Text Embeddings

publication · 2026-05-16

A new research paper introduces ML-Embed, a suite of multilingual text embedding models designed to address three barriers: high computational costs, narrow linguistic focus, and lack of transparency. The models are built on a 3-Dimensional Matryoshka Learning (3D-ML) framework, which includes Matryoshka Representation Learning (MRL) for storage efficiency, Matryoshka Layer Learning (MLL) for flexible inference, and a new Matryoshka Embedding Learning (MEL) for parameter efficiency. The authors curated a massively multilingual dataset to train the models, aiming to make embeddings more inclusive and efficient for a wide range of languages.

Key facts

ML-Embed is a suite of inclusive and efficient text embedding models.
The models address prohibitive computational costs, narrow linguistic focus, and lack of transparency.
The framework is called 3-Dimensional Matryoshka Learning (3D-ML).
3D-ML includes MRL, MLL, and the newly introduced MEL.
MEL enhances parameter efficiency.
A massively multilingual dataset was curated for training.
The paper is available on arXiv with ID 2605.15081.
The research aims to democratize high-quality embeddings for many languages.

ML-Embed: Efficient Multilingual Text Embeddings

Key facts

Entities

Institutions

Sources