ClipTBP: Clip-Pair Temporal Boundary Prediction for Video Moment Retrieval
A new framework called ClipTBP (Clip-Pair Temporal Boundary Prediction) has been proposed for video moment retrieval, the task of locating specific video segments matching a text query. Existing models struggle with visually similar segments and ignore relationships between multiple answer segments for a single query. ClipTBP introduces clip-level alignment and boundary-aware learning to address these issues. The method calculates similarity by considering clip pairs rather than individual snippets, improving exclusion of irrelevant segments. The framework is detailed in arXiv paper 2604.27591.
Key facts
- ClipTBP is a clip-pair temporal boundary prediction framework.
- It uses boundary-aware learning for video moment retrieval.
- Existing models calculate similarity at snippet-level and ignore relationships between multiple answer segments.
- ClipTBP introduces clip-level alignment.
- The method addresses issues with visually similar segments in surrounding context.
- The paper is available on arXiv with ID 2604.27591.
- The announcement type is cross.
- Video moment retrieval matches video segments to text queries.
Entities
Institutions
- arXiv