Multimodal Distribution Matching for Vision-Language Dataset Distillation

ai-technology · 2026-05-25

A team of researchers has introduced Multimodal Distribution Matching (MDM), a framework that is sensitive to geometry for condensing extensive vision-language datasets into smaller synthetic collections. MDM tackles the issues of elevated computational expenses and the oversight of cross-modal relationships found in previous approaches by merging data, model, and loss-level elements. At the data stage, it generates synthetic image-text pairs through clustering within a joint embedding space. On the model front, it constructs a mixed teacher by blending fine-tuned models based on angular deviation from a pretrained anchor. The objective of this framework is to maintain the quality of representation and cross-modal alignment while adhering to stringent compute and memory constraints.

Key facts

arXiv:2605.23482v1
Multimodal Distribution Matching (MDM) is proposed for vision-language dataset distillation
MDM is a geometry-aware framework
It integrates data, model, and loss-level components
Data level: initializes synthetic pairs by sampling from clusters in joint embedding space
Model level: forms mixed teacher by interpolating fine-tuned models in weight space
Interpolation based on angular deviation from pretrained anchor
Aims to preserve representation quality and cross-modal alignment under tight budgets

Multimodal Distribution Matching for Vision-Language Dataset Distillation

Key facts

Entities

Institutions

Sources