Modality Gap in OOD Detection with Vision-Language Models

ai-technology · 2026-05-27

A new paper on arXiv (2605.26661) challenges the common practice of using text embeddings as class prototypes for zero-shot out-of-distribution (OOD) detection in pre-trained vision-language models (VLMs). The authors theoretically demonstrate that off-the-shelf textual prototypes are misaligned with optimal visual prototypes, creating an intrinsic modality gap that prompt engineering alone cannot fix. To address this under post-hoc constraints, they propose an online pseudo-supervised framework that learns class prototypes directly in the visual feature space from unlabeled test-time data streams.

Key facts

arXiv paper 2605.26661
Title: Respecting Modality Gap in Post-hoc Out-of-distribution Detection with Pre-trained Vision-Language Models
Challenges text-as-prototype paradigm for zero-shot OOD detection
Shows theoretical misalignment between textual and visual prototypes
Proposes online pseudo-supervised framework to learn visual prototypes from test-time data
Method operates under post-hoc constraint without access to training data

Modality Gap in OOD Detection with Vision-Language Models

Key facts

Entities

Institutions

Sources