Weakly-Supervised Video Grounding as a Game

other · 2026-05-27

Researchers propose a game-theoretic approach to weakly-supervised video temporal grounding, addressing limitations in existing methods. Current frameworks rely on moment proposal selection with contrastive learning and reconstruction, but overlook coarse-grained cross-modal learning and complex proposal dependencies. The new method models fine-grained video-frame-to-query-word alignment and eliminates the need for predefined proposals. This is the first attempt to frame the task as a game, improving grounding accuracy without costly proposal generation.

Key facts

Task: weakly-supervised video temporal grounding
Existing methods use moment proposal selection with contrastive learning and reconstruction
Two issues identified: coarse-grained cross-modal learning and complex moment proposals
Proposed method: game perspective for the first time
Aims to capture detailed consistency between video frames and query words
Eliminates reliance on predefined moment proposals
Source: arXiv preprint 2605.26441
Published on arXiv

Weakly-Supervised Video Grounding as a Game

Key facts

Entities

Institutions

Sources