Minos: A Multimodal Evaluation Model for Image-Text Generation

other · 2026-04-30

Researchers have developed Minos, a multimodal evaluation model designed to assess both image-to-text (I2T) and text-to-image (T2I) generation tasks. The model is trained on a new dataset, Minos-57K, which comprises 57,000 evaluation samples across 15 datasets, constructed through rigorous quality control strategies. Using supervised fine-tuning and preference alignment, Minos achieves strong performance despite using less than half the training data of prior work. The research addresses limitations in traditional multimodal evaluation metrics and the inconsistent performance of existing models across I2T and T2I tasks. The paper is available on arXiv under identifier 2506.02494.

Key facts

Minos is a multimodal evaluation model for I2T and T2I tasks.
Trained on Minos-57K dataset with 57,000 samples across 15 datasets.
Uses SFT and preference alignment training strategies.
Uses less than half the training data of prior work.
Addresses limitations of traditional multimodal evaluation metrics.
Paper available on arXiv: 2506.02494.

Minos: A Multimodal Evaluation Model for Image-Text Generation

Key facts

Entities

Institutions

Sources