ML-Embed: Efficient Multilingual Text Embeddings
A new research paper introduces ML-Embed, a suite of multilingual text embedding models designed to address three barriers: high computational costs, narrow linguistic focus, and lack of transparency. The models are built on a 3-Dimensional Matryoshka Learning (3D-ML) framework, which includes Matryoshka Representation Learning (MRL) for storage efficiency, Matryoshka Layer Learning (MLL) for flexible inference, and a new Matryoshka Embedding Learning (MEL) for parameter efficiency. The authors curated a massively multilingual dataset to train the models, aiming to make embeddings more inclusive and efficient for a wide range of languages.
Key facts
- ML-Embed is a suite of inclusive and efficient text embedding models.
- The models address prohibitive computational costs, narrow linguistic focus, and lack of transparency.
- The framework is called 3-Dimensional Matryoshka Learning (3D-ML).
- 3D-ML includes MRL, MLL, and the newly introduced MEL.
- MEL enhances parameter efficiency.
- A massively multilingual dataset was curated for training.
- The paper is available on arXiv with ID 2605.15081.
- The research aims to democratize high-quality embeddings for many languages.
Entities
Institutions
- arXiv