ARTFEED — Contemporary Art Intelligence

VLM latent spaces contain 164 dimensions of non-semantic noise

publication · 2026-05-16

A new arXiv preprint (2605.14893) reveals that contrastively pretrained Vision-Language Models (VLMs) like CLIP harbor significant non-semantic noise in their shared latent spaces. Researchers applied spectral decomposition to covariance matrices, separating multi-modal semantic signals from a shared noise subspace. They found that this noise geometry exhibits strong subgroup invariance across diverse data subsets. Pruning these noise dimensions is largely harmless and can even improve downstream task performance. The study suggests that a substantial fraction of VLM latent geometry is governed by architecture-level noise rather than task-relevant semantics, offering new mechanistic insights into representational structure.

Key facts

  • arXiv:2605.14893
  • Contrastively pretrained VLMs have structural anomalies in latent spaces
  • Spectral decomposition of covariance matrices used
  • Noise geometry shows strong subgroup invariance
  • Pruning noise dimensions preserves or improves performance
  • Noise is architecture-level, not task-relevant semantics
  • 164 dimensions of noise identified in CLIP
  • New mechanistic insights into VLM representations

Entities

Institutions

  • arXiv

Sources