ARTFEED — Contemporary Art Intelligence

Grounding Temporal Consistency in Video Object-Centric Learning via Correspondence

other · 2026-05-12

A new framework called Grounded Correspondence replaces learned temporal prediction modules in video object-centric learning with deterministic bipartite matching. The approach leverages instance-discriminative features from frozen self-supervised vision backbones to maintain object identity across frames. By initializing slots from salient regions and using Hungarian matching for frame-to-frame identity, the method requires zero learnable parameters for temporal modeling. It achieves competitive performance on MOVi-D, MOVi-E, and YouTube-VIS datasets. The paper is available on arXiv with project page at https://magent.

Key facts

  • Grounded Correspondence replaces learned temporal prediction with bipartite matching
  • Uses frozen self-supervised vision backbones for instance-discriminative features
  • Slots initialize from salient regions in backbone features
  • Hungarian matching maintains frame-to-frame identity
  • Zero learnable parameters for temporal modeling
  • Competitive performance on MOVi-D, MOVi-E, and YouTube-VIS
  • Paper on arXiv: 2605.03650
  • Project page: https://magent

Entities

Institutions

  • arXiv

Sources