MiMIC: A New Method to Fix Visual Modality Collapse in Multimodal Retrieval
A new paper on arXiv (2604.21326) introduces MiMIC, a method to address visual modality collapse in universal multimodal retrieval (UMR). UMR aims to map different modalities like visual and textual data into a shared embedding space. Existing approaches include early-fusion methods like Marvel, which projects visual features into the language model space, and late-fusion methods like UniVL-DR, which uses separate encoders. The pilot study found that Marvel suffers from visual modality collapse, ignoring visual features and relying too much on text. UniVL-DR is less affected by collapse but prone to semantic misalignment, where related content is far apart in the embedding space. MiMIC is proposed to mitigate both issues.
Key facts
- Paper arXiv:2604.21326 introduces MiMIC.
- MiMIC addresses visual modality collapse in UMR.
- UMR maps different modalities into a shared embedding space.
- Marvel is an early-fusion method that projects visual features into LM space.
- UniVL-DR is a late-fusion method using separate encoders.
- Marvel exhibits visual modality collapse, ignoring visual features.
- UniVL-DR is less affected by collapse but has semantic misalignment.
- MiMIC aims to mitigate both collapse and misalignment.
Entities
Institutions
- arXiv