VLM latent spaces contain 164 dimensions of non-semantic noise

publication · 2026-05-16

A new arXiv preprint (2605.14893) reveals that contrastively pretrained Vision-Language Models (VLMs) like CLIP harbor significant non-semantic noise in their shared latent spaces. Researchers applied spectral decomposition to covariance matrices, separating multi-modal semantic signals from a shared noise subspace. They found that this noise geometry exhibits strong subgroup invariance across diverse data subsets. Pruning these noise dimensions is largely harmless and can even improve downstream task performance. The study suggests that a substantial fraction of VLM latent geometry is governed by architecture-level noise rather than task-relevant semantics, offering new mechanistic insights into representational structure.

Key facts

arXiv:2605.14893
Contrastively pretrained VLMs have structural anomalies in latent spaces
Spectral decomposition of covariance matrices used
Noise geometry shows strong subgroup invariance
Pruning noise dimensions preserves or improves performance
Noise is architecture-level, not task-relevant semantics
164 dimensions of noise identified in CLIP
New mechanistic insights into VLM representations

VLM latent spaces contain 164 dimensions of non-semantic noise

Key facts

Entities

Institutions

Sources