Multimodal Distribution Matching for Vision-Language Dataset Distillation
A team of researchers has introduced Multimodal Distribution Matching (MDM), a framework that is sensitive to geometry for condensing extensive vision-language datasets into smaller synthetic collections. MDM tackles the issues of elevated computational expenses and the oversight of cross-modal relationships found in previous approaches by merging data, model, and loss-level elements. At the data stage, it generates synthetic image-text pairs through clustering within a joint embedding space. On the model front, it constructs a mixed teacher by blending fine-tuned models based on angular deviation from a pretrained anchor. The objective of this framework is to maintain the quality of representation and cross-modal alignment while adhering to stringent compute and memory constraints.
Key facts
- arXiv:2605.23482v1
- Multimodal Distribution Matching (MDM) is proposed for vision-language dataset distillation
- MDM is a geometry-aware framework
- It integrates data, model, and loss-level components
- Data level: initializes synthetic pairs by sampling from clusters in joint embedding space
- Model level: forms mixed teacher by interpolating fine-tuned models in weight space
- Interpolation based on angular deviation from pretrained anchor
- Aims to preserve representation quality and cross-modal alignment under tight budgets
Entities
Institutions
- arXiv