Benchmarking Video-Text Retrieval Under Query Shifts

ai-technology · 2026-04-25

A recent study published on arXiv presents a new benchmark aimed at assessing video-text retrieval (VTR) models in the context of real-world query shifts, where the distribution of query data diverges from that of the training data. This benchmark encompasses 12 types of video perturbations categorized into five levels of severity. The findings indicate that query shifts intensify the hubness issue, where a small number of gallery items dominate and attract the majority of queries. To counter this, the researchers introduce HAT-VTR (Hubness Alleviation for Test-time Video-Text Retrieval), a framework designed for test-time adaptation that directly addresses hubness. This research underscores the susceptibility of existing VTR models to distribution shifts and lays the groundwork for more resilient retrieval systems.

Key facts

arXiv paper 2604.20851
12 distinct types of video perturbations
Five severity degrees
Query shifts amplify hubness phenomenon
HAT-VTR proposed as baseline test-time adaptation framework
Existing image-focused solutions inadequate for video
Complex spatio-temporal dynamics in video shifts
Sharp performance drop under query shifts

Benchmarking Video-Text Retrieval Under Query Shifts

Key facts

Entities

Institutions

Sources