ARTFEED — Contemporary Art Intelligence

Single Hub Text Exploits CLIP Cross-Modal Vulnerabilities

ai-technology · 2026-05-01

Researchers have identified a critical vulnerability in cross-modal encoders like CLIP, where a single hub text can achieve unreasonably high similarity scores across unrelated images. The hubness problem, common in high-dimensional spaces, poses practical threats to information retrieval and automatic evaluation metrics. The proposed method detects hub embeddings and their corresponding hub texts. Experiments on MSCOCO and nocaps for image captioning evaluation, and MSCOCO and Flickr30k for image-to-text retrieval, demonstrate that a single hub text can match or exceed the similarity scores of correct captions. This reveals systemic weaknesses in cross-modal similarity computation, which relies on shared embedding spaces rather than direct comparisons like string matching.

Key facts

  • Hubness problem occurs in high-dimensional embedding spaces
  • Cross-modal encoders project text and images into a shared space
  • Proposed method identifies hub embeddings and hub texts
  • Experiments conducted on MSCOCO, nocaps, and Flickr30k datasets
  • Single hub text achieves comparable or higher similarity scores than correct captions
  • Vulnerability affects information retrieval and automatic evaluation metrics
  • Cross-modal similarity cannot use direct string matching
  • Study published on arXiv with ID 2604.27674

Entities

Institutions

  • arXiv

Sources