ARTFEED — Contemporary Art Intelligence

Motion-Centric Self-Supervised Video Representation Learning

publication · 2026-05-25

A new arXiv paper (2605.23045) proposes a self-supervised video representation learning method that uses motion as the central modality. The approach employs point-tracks to capture motion and a masked-autoencoder to reconstruct missing tracks. This method avoids the high costs of scaling video models and the limitations of language-supervised learning, which restricts concepts to those in captions. The authors argue that current video models still struggle with temporal understanding, and their motion-focused technique aims to address this gap. The paper demonstrates that learning from motion alone can produce effective video representations without relying on language or large-scale datasets.

Key facts

  • Paper ID: arXiv:2605.23045
  • Title: The TIME Machine: On The Power of Motion for Efficient Perception
  • Proposes motion as central modality for video representation
  • Uses point-tracks to represent motion in video
  • Employs a masked-autoencoder to reconstruct missing tracks
  • Self-supervised learning method
  • Aims to overcome limitations of scaling and language-supervised learning
  • Focuses on improving temporal understanding in video models

Entities

Institutions

  • arXiv

Sources