RayDer: A Unified Transformer for Scalable Self-Supervised Novel View Synthesis

ai-technology · 2026-06-01

A new unified feed-forward transformer named RayDer has been developed by researchers, integrating camera estimation, scene reconstruction, and rendering into one framework for self-supervised novel view synthesis (NVS) from real-world video. By minimizing the impact of dynamic elements, RayDer facilitates stable training on unconstrained video while focusing on static-scene NVS as the primary objective. The model demonstrates effective power-law scaling with both data and computational resources across various sizes, surpassing earlier static-scene NVS approaches. This research tackles the challenges of training on realistic videos and the unpredictable scaling of multi-network systems, transforming self-supervised NVS into a coherent single-model scaling issue. The paper can be found on arXiv with the reference 2605.31535.

Key facts

RayDer is a unified feed-forward transformer for novel view synthesis.
It consolidates camera estimation, scene reconstruction, and rendering into one backbone.
A minimal dynamic state absorbs time-varying content for stable training.
Dynamic content is used only as scalable supervision, not reconstructed.
RayDer exhibits clean power-law scaling with data and compute.
It outperforms static-scene NVS methods.
The paper is on arXiv: 2605.31535.
Self-supervised NVS from real-world video is the focus.

RayDer: A Unified Transformer for Scalable Self-Supervised Novel View Synthesis

Key facts

Entities

Institutions

Sources