RayDer: A Unified Transformer for Scalable Self-Supervised Novel View Synthesis
A new unified feed-forward transformer named RayDer has been developed by researchers, integrating camera estimation, scene reconstruction, and rendering into one framework for self-supervised novel view synthesis (NVS) from real-world video. By minimizing the impact of dynamic elements, RayDer facilitates stable training on unconstrained video while focusing on static-scene NVS as the primary objective. The model demonstrates effective power-law scaling with both data and computational resources across various sizes, surpassing earlier static-scene NVS approaches. This research tackles the challenges of training on realistic videos and the unpredictable scaling of multi-network systems, transforming self-supervised NVS into a coherent single-model scaling issue. The paper can be found on arXiv with the reference 2605.31535.
Key facts
- RayDer is a unified feed-forward transformer for novel view synthesis.
- It consolidates camera estimation, scene reconstruction, and rendering into one backbone.
- A minimal dynamic state absorbs time-varying content for stable training.
- Dynamic content is used only as scalable supervision, not reconstructed.
- RayDer exhibits clean power-law scaling with data and compute.
- It outperforms static-scene NVS methods.
- The paper is on arXiv: 2605.31535.
- Self-supervised NVS from real-world video is the focus.
Entities
Institutions
- arXiv