MiMIC: A New Method to Fix Visual Modality Collapse in Multimodal Retrieval

publication · 2026-04-25

A new paper on arXiv (2604.21326) introduces MiMIC, a method to address visual modality collapse in universal multimodal retrieval (UMR). UMR aims to map different modalities like visual and textual data into a shared embedding space. Existing approaches include early-fusion methods like Marvel, which projects visual features into the language model space, and late-fusion methods like UniVL-DR, which uses separate encoders. The pilot study found that Marvel suffers from visual modality collapse, ignoring visual features and relying too much on text. UniVL-DR is less affected by collapse but prone to semantic misalignment, where related content is far apart in the embedding space. MiMIC is proposed to mitigate both issues.

Key facts

Paper arXiv:2604.21326 introduces MiMIC.
MiMIC addresses visual modality collapse in UMR.
UMR maps different modalities into a shared embedding space.
Marvel is an early-fusion method that projects visual features into LM space.
UniVL-DR is a late-fusion method using separate encoders.
Marvel exhibits visual modality collapse, ignoring visual features.
UniVL-DR is less affected by collapse but has semantic misalignment.
MiMIC aims to mitigate both collapse and misalignment.

MiMIC: A New Method to Fix Visual Modality Collapse in Multimodal Retrieval

Key facts

Entities

Institutions

Sources