Text-Only Queries Can Reveal PII Memorization in Multimodal Models

ai-technology · 2026-04-24

A recent study reveals that multimodal contrastive pre-training models such as CLIP and CLAP can retain Personally Identifiable Information (PII) from extensive web data. Notably, this retention can be assessed through text queries alone, eliminating the need for biometric data. The researchers introduce the Unimodal Membership Inference Detector (UMID), a framework that utilizes text-based membership inference attacks (MIAs) to identify whether certain PII has been memorized. This method circumvents the computational challenges associated with shadow-model MIAs for large multimodal architectures and protects sensitive biometric information from the target model. The findings underscore the privacy concerns linked to foundational encoders employed in multimodal large models.

Key facts

Contrastive pre-training models like CLIP and CLAP memorize PII from web-scale data.
Existing multimodal auditing methods require paired biometric inputs, exposing sensitive data.
UMID uses only text queries to infer multimodal memorization.
Shadow-model MIAs are computationally prohibitive for large multimodal backbones.
The study was published on arXiv under identifier 2603.14222.
The paper is a replacement/cross announcement on arXiv.
UMID stands for Unimodal Membership Inference Detector.
The research addresses privacy auditing for foundational encoders.

Text-Only Queries Can Reveal PII Memorization in Multimodal Models

Key facts

Entities

Institutions

Sources