ARTFEED — Contemporary Art Intelligence

USV Dataset for User-Generated Short-Form Video Understanding

digital · 2026-05-22

The newly launched dataset, named USV (User-generated Short-form Video), aims to enhance the comprehension of short-form videos at a high semantic level. It includes around 224,000 videos gathered from user-generated content platforms through label queries, without any manual editing or verification. Two primary tasks have been defined: video-text retrieval and topic recognition. For the topic recognition task, baseline methods such as Multi-Modality Fusion Network (MMF-Net) and Video-Text Contrastive Learning (VTCL) have been introduced. The study emphasizes that current approaches to video understanding primarily focus on instance-level recognition, which falls short of capturing high-level semantic insights.

Key facts

  • USV dataset contains around 224K videos from UGC platforms.
  • Videos collected by label queries without manual verification or trimming.
  • Two tasks: topic recognition and video-text retrieval.
  • Baseline methods: MMF-Net and VTCL for topic recognition.
  • Existing video understanding focuses on instance-level recognition.
  • High-level semantic video understanding is the goal.
  • Dataset published on arXiv with ID 2605.20838.
  • Announce type: cross.

Entities

Institutions

  • arXiv

Sources