ARTFEED — Contemporary Art Intelligence

Benchmarking Video-Text Retrieval Under Query Shifts

ai-technology · 2026-04-25

A recent study published on arXiv presents a new benchmark aimed at assessing video-text retrieval (VTR) models in the context of real-world query shifts, where the distribution of query data diverges from that of the training data. This benchmark encompasses 12 types of video perturbations categorized into five levels of severity. The findings indicate that query shifts intensify the hubness issue, where a small number of gallery items dominate and attract the majority of queries. To counter this, the researchers introduce HAT-VTR (Hubness Alleviation for Test-time Video-Text Retrieval), a framework designed for test-time adaptation that directly addresses hubness. This research underscores the susceptibility of existing VTR models to distribution shifts and lays the groundwork for more resilient retrieval systems.

Key facts

  • arXiv paper 2604.20851
  • 12 distinct types of video perturbations
  • Five severity degrees
  • Query shifts amplify hubness phenomenon
  • HAT-VTR proposed as baseline test-time adaptation framework
  • Existing image-focused solutions inadequate for video
  • Complex spatio-temporal dynamics in video shifts
  • Sharp performance drop under query shifts

Entities

Institutions

  • arXiv

Sources