Grounding Temporal Consistency in Video Object-Centric Learning via Correspondence
A new framework called Grounded Correspondence replaces learned temporal prediction modules in video object-centric learning with deterministic bipartite matching. The approach leverages instance-discriminative features from frozen self-supervised vision backbones to maintain object identity across frames. By initializing slots from salient regions and using Hungarian matching for frame-to-frame identity, the method requires zero learnable parameters for temporal modeling. It achieves competitive performance on MOVi-D, MOVi-E, and YouTube-VIS datasets. The paper is available on arXiv with project page at https://magent.
Key facts
- Grounded Correspondence replaces learned temporal prediction with bipartite matching
- Uses frozen self-supervised vision backbones for instance-discriminative features
- Slots initialize from salient regions in backbone features
- Hungarian matching maintains frame-to-frame identity
- Zero learnable parameters for temporal modeling
- Competitive performance on MOVi-D, MOVi-E, and YouTube-VIS
- Paper on arXiv: 2605.03650
- Project page: https://magent
Entities
Institutions
- arXiv