Grounding Temporal Consistency in Video Object-Centric Learning via Correspondence

other · 2026-05-12

A new framework called Grounded Correspondence replaces learned temporal prediction modules in video object-centric learning with deterministic bipartite matching. The approach leverages instance-discriminative features from frozen self-supervised vision backbones to maintain object identity across frames. By initializing slots from salient regions and using Hungarian matching for frame-to-frame identity, the method requires zero learnable parameters for temporal modeling. It achieves competitive performance on MOVi-D, MOVi-E, and YouTube-VIS datasets. The paper is available on arXiv with project page at https://magent.

Key facts

Grounded Correspondence replaces learned temporal prediction with bipartite matching
Uses frozen self-supervised vision backbones for instance-discriminative features
Slots initialize from salient regions in backbone features
Hungarian matching maintains frame-to-frame identity
Zero learnable parameters for temporal modeling
Competitive performance on MOVi-D, MOVi-E, and YouTube-VIS
Paper on arXiv: 2605.03650
Project page: https://magent

Grounding Temporal Consistency in Video Object-Centric Learning via Correspondence

Key facts

Entities

Institutions

Sources