ARTFEED — Contemporary Art Intelligence

MiMIC: A New Method to Fix Visual Modality Collapse in Multimodal Retrieval

publication · 2026-04-25

A new paper on arXiv (2604.21326) introduces MiMIC, a method to address visual modality collapse in universal multimodal retrieval (UMR). UMR aims to map different modalities like visual and textual data into a shared embedding space. Existing approaches include early-fusion methods like Marvel, which projects visual features into the language model space, and late-fusion methods like UniVL-DR, which uses separate encoders. The pilot study found that Marvel suffers from visual modality collapse, ignoring visual features and relying too much on text. UniVL-DR is less affected by collapse but prone to semantic misalignment, where related content is far apart in the embedding space. MiMIC is proposed to mitigate both issues.

Key facts

  • Paper arXiv:2604.21326 introduces MiMIC.
  • MiMIC addresses visual modality collapse in UMR.
  • UMR maps different modalities into a shared embedding space.
  • Marvel is an early-fusion method that projects visual features into LM space.
  • UniVL-DR is a late-fusion method using separate encoders.
  • Marvel exhibits visual modality collapse, ignoring visual features.
  • UniVL-DR is less affected by collapse but has semantic misalignment.
  • MiMIC aims to mitigate both collapse and misalignment.

Entities

Institutions

  • arXiv

Sources