MONET: Open-Source Dataset of 104.9M Image-Text Pairs Released

digital · 2026-05-22

MONET, a newly launched open dataset under the Apache 2.0 license, consists of around 104.9 million image-text pairs sourced from 2.9 billion raw pairs obtained from diverse open platforms. This dataset has undergone multiple safety and domain-based filtering processes, as well as the removal of exact and near-duplicates, and has been re-captioned using various vision-language models that range from short to long descriptions. Additionally, it includes synthetically generated samples. Each image is accompanied by pre-computed embeddings and annotations to facilitate downstream applications. To test MONET, a latent diffusion model with 4 billion parameters trained solely on this dataset achieved notable GenEval and DPG scores, promoting open and reproducible research in text-to-image generation.

Key facts

MONET dataset contains ~104.9M image-text pairs
Sourced from 2.9B raw pairs across heterogeneous open sources
Includes safety filtering, domain filtering, deduplication, and re-captioning
Re-captioned with multiple vision-language models
Augmented with synthetically generated samples
Each image has pre-computed embeddings and annotations
A 4B-parameter latent diffusion model trained on MONET achieved competitive GenEval and DPG scores
Dataset released under Apache 2.0 license

Entities

—

Sources

arXiv cs.AI — 2026-05-21