USV Dataset for User-Generated Short-Form Video Understanding
The newly launched dataset, named USV (User-generated Short-form Video), aims to enhance the comprehension of short-form videos at a high semantic level. It includes around 224,000 videos gathered from user-generated content platforms through label queries, without any manual editing or verification. Two primary tasks have been defined: video-text retrieval and topic recognition. For the topic recognition task, baseline methods such as Multi-Modality Fusion Network (MMF-Net) and Video-Text Contrastive Learning (VTCL) have been introduced. The study emphasizes that current approaches to video understanding primarily focus on instance-level recognition, which falls short of capturing high-level semantic insights.
Key facts
- USV dataset contains around 224K videos from UGC platforms.
- Videos collected by label queries without manual verification or trimming.
- Two tasks: topic recognition and video-text retrieval.
- Baseline methods: MMF-Net and VTCL for topic recognition.
- Existing video understanding focuses on instance-level recognition.
- High-level semantic video understanding is the goal.
- Dataset published on arXiv with ID 2605.20838.
- Announce type: cross.
Entities
Institutions
- arXiv