MONET: Open-Source Dataset of 104.9M Image-Text Pairs Released
MONET, a newly launched open dataset under the Apache 2.0 license, consists of around 104.9 million image-text pairs sourced from 2.9 billion raw pairs obtained from diverse open platforms. This dataset has undergone multiple safety and domain-based filtering processes, as well as the removal of exact and near-duplicates, and has been re-captioned using various vision-language models that range from short to long descriptions. Additionally, it includes synthetically generated samples. Each image is accompanied by pre-computed embeddings and annotations to facilitate downstream applications. To test MONET, a latent diffusion model with 4 billion parameters trained solely on this dataset achieved notable GenEval and DPG scores, promoting open and reproducible research in text-to-image generation.
Key facts
- MONET dataset contains ~104.9M image-text pairs
- Sourced from 2.9B raw pairs across heterogeneous open sources
- Includes safety filtering, domain filtering, deduplication, and re-captioning
- Re-captioned with multiple vision-language models
- Augmented with synthetically generated samples
- Each image has pre-computed embeddings and annotations
- A 4B-parameter latent diffusion model trained on MONET achieved competitive GenEval and DPG scores
- Dataset released under Apache 2.0 license
Entities
—