USV Dataset for User-Generated Short-Form Video Understanding

digital · 2026-05-22

The newly launched dataset, named USV (User-generated Short-form Video), aims to enhance the comprehension of short-form videos at a high semantic level. It includes around 224,000 videos gathered from user-generated content platforms through label queries, without any manual editing or verification. Two primary tasks have been defined: video-text retrieval and topic recognition. For the topic recognition task, baseline methods such as Multi-Modality Fusion Network (MMF-Net) and Video-Text Contrastive Learning (VTCL) have been introduced. The study emphasizes that current approaches to video understanding primarily focus on instance-level recognition, which falls short of capturing high-level semantic insights.

Key facts

USV dataset contains around 224K videos from UGC platforms.
Videos collected by label queries without manual verification or trimming.
Two tasks: topic recognition and video-text retrieval.
Baseline methods: MMF-Net and VTCL for topic recognition.
Existing video understanding focuses on instance-level recognition.
High-level semantic video understanding is the goal.
Dataset published on arXiv with ID 2605.20838.
Announce type: cross.

USV Dataset for User-Generated Short-Form Video Understanding

Key facts

Entities

Institutions

Sources