Single Hub Text Exploits CLIP Cross-Modal Vulnerabilities

ai-technology · 2026-05-01

Researchers have identified a critical vulnerability in cross-modal encoders like CLIP, where a single hub text can achieve unreasonably high similarity scores across unrelated images. The hubness problem, common in high-dimensional spaces, poses practical threats to information retrieval and automatic evaluation metrics. The proposed method detects hub embeddings and their corresponding hub texts. Experiments on MSCOCO and nocaps for image captioning evaluation, and MSCOCO and Flickr30k for image-to-text retrieval, demonstrate that a single hub text can match or exceed the similarity scores of correct captions. This reveals systemic weaknesses in cross-modal similarity computation, which relies on shared embedding spaces rather than direct comparisons like string matching.

Key facts

Hubness problem occurs in high-dimensional embedding spaces
Cross-modal encoders project text and images into a shared space
Proposed method identifies hub embeddings and hub texts
Experiments conducted on MSCOCO, nocaps, and Flickr30k datasets
Single hub text achieves comparable or higher similarity scores than correct captions
Vulnerability affects information retrieval and automatic evaluation metrics
Cross-modal similarity cannot use direct string matching
Study published on arXiv with ID 2604.27674

Single Hub Text Exploits CLIP Cross-Modal Vulnerabilities

Key facts

Entities

Institutions

Sources