Motion-Centric Self-Supervised Video Representation Learning
A new arXiv paper (2605.23045) proposes a self-supervised video representation learning method that uses motion as the central modality. The approach employs point-tracks to capture motion and a masked-autoencoder to reconstruct missing tracks. This method avoids the high costs of scaling video models and the limitations of language-supervised learning, which restricts concepts to those in captions. The authors argue that current video models still struggle with temporal understanding, and their motion-focused technique aims to address this gap. The paper demonstrates that learning from motion alone can produce effective video representations without relying on language or large-scale datasets.
Key facts
- Paper ID: arXiv:2605.23045
- Title: The TIME Machine: On The Power of Motion for Efficient Perception
- Proposes motion as central modality for video representation
- Uses point-tracks to represent motion in video
- Employs a masked-autoencoder to reconstruct missing tracks
- Self-supervised learning method
- Aims to overcome limitations of scaling and language-supervised learning
- Focuses on improving temporal understanding in video models
Entities
Institutions
- arXiv