Weakly-Supervised Video Grounding as a Game
Researchers propose a game-theoretic approach to weakly-supervised video temporal grounding, addressing limitations in existing methods. Current frameworks rely on moment proposal selection with contrastive learning and reconstruction, but overlook coarse-grained cross-modal learning and complex proposal dependencies. The new method models fine-grained video-frame-to-query-word alignment and eliminates the need for predefined proposals. This is the first attempt to frame the task as a game, improving grounding accuracy without costly proposal generation.
Key facts
- Task: weakly-supervised video temporal grounding
- Existing methods use moment proposal selection with contrastive learning and reconstruction
- Two issues identified: coarse-grained cross-modal learning and complex moment proposals
- Proposed method: game perspective for the first time
- Aims to capture detailed consistency between video frames and query words
- Eliminates reliance on predefined moment proposals
- Source: arXiv preprint 2605.26441
- Published on arXiv
Entities
Institutions
- arXiv