Motion-Centric Self-Supervised Video Representation Learning

publication · 2026-05-25

A new arXiv paper (2605.23045) proposes a self-supervised video representation learning method that uses motion as the central modality. The approach employs point-tracks to capture motion and a masked-autoencoder to reconstruct missing tracks. This method avoids the high costs of scaling video models and the limitations of language-supervised learning, which restricts concepts to those in captions. The authors argue that current video models still struggle with temporal understanding, and their motion-focused technique aims to address this gap. The paper demonstrates that learning from motion alone can produce effective video representations without relying on language or large-scale datasets.

Key facts

Paper ID: arXiv:2605.23045
Title: The TIME Machine: On The Power of Motion for Efficient Perception
Proposes motion as central modality for video representation
Uses point-tracks to represent motion in video
Employs a masked-autoencoder to reconstruct missing tracks
Self-supervised learning method
Aims to overcome limitations of scaling and language-supervised learning
Focuses on improving temporal understanding in video models

Motion-Centric Self-Supervised Video Representation Learning

Key facts

Entities

Institutions

Sources