Modality Gap in OOD Detection with Vision-Language Models
A new paper on arXiv (2605.26661) challenges the common practice of using text embeddings as class prototypes for zero-shot out-of-distribution (OOD) detection in pre-trained vision-language models (VLMs). The authors theoretically demonstrate that off-the-shelf textual prototypes are misaligned with optimal visual prototypes, creating an intrinsic modality gap that prompt engineering alone cannot fix. To address this under post-hoc constraints, they propose an online pseudo-supervised framework that learns class prototypes directly in the visual feature space from unlabeled test-time data streams.
Key facts
- arXiv paper 2605.26661
- Title: Respecting Modality Gap in Post-hoc Out-of-distribution Detection with Pre-trained Vision-Language Models
- Challenges text-as-prototype paradigm for zero-shot OOD detection
- Shows theoretical misalignment between textual and visual prototypes
- Proposes online pseudo-supervised framework to learn visual prototypes from test-time data
- Method operates under post-hoc constraint without access to training data
Entities
Institutions
- arXiv