Vision-Language Models Recover Hypernym Knowledge from Language Alone

publication · 2026-04-24

A new study from arXiv investigates how vision-language models (VLMs) generalize hypernym knowledge when visual evidence is limited. Researchers at an undisclosed institution froze both the image encoder and language model (LM), training only the intermediate mappings. They progressively deprived the VLM of explicit hypernym evidence during training, testing whether the LM could recover this knowledge. Results show that LMs can generalize hypernyms even in the most extreme case—no hypernym evidence during training. The study explores the interplay between semantic representations learned from surface form versus grounded evidence, focusing on predicting hypernyms of objects in images. Additional experiments suggest further capabilities, though details are not provided. The paper is available on arXiv under ID 2603.07474.

Key facts

Study examines cross-modal taxonomic generalization in vision-language models.
Researchers froze both image encoder and language model, training only intermediate mappings.
VLMs were progressively deprived of explicit hypernym evidence during training.
LMs recovered hypernym knowledge even without any hypernym evidence during training.
Focus on predicting hypernyms of objects represented in images.
Paper available on arXiv with ID 2603.07474.
Additional experiments suggest further generalization capabilities.
Interplay between semantic representations from surface form and grounded evidence is studied.

Vision-Language Models Recover Hypernym Knowledge from Language Alone

Key facts

Entities

Institutions

Sources