Rays as Pixels: Joint Video and Camera Trajectory Diffusion Model

ai-technology · 2026-04-24

A new video diffusion model (VDM) called Rays as Pixels has been introduced by researchers, which captures a joint distribution of videos and camera movements. This innovative model is the first to simultaneously forecast camera poses and enable camera-controlled video creation within one framework. It encodes each camera as dense ray pixels (raxels), aligning pixel representation with the latent space of video frames, and utilizes a Decoupled Self-Cross Attention mechanism for joint denoising. The model efficiently performs three functions: it predicts camera trajectories from videos, generates videos from given images along a specified path, and synthesizes both video and trajectory together. The research can be found on arXiv with ID 2604.09429.

Key facts

First model to combine camera pose prediction and camera-controlled video generation in one framework.
Uses dense ray pixels (raxels) as a pixel-aligned encoding for cameras.
Employs Decoupled Self-Cross Attention mechanism for joint denoising.
Handles three tasks: trajectory prediction, video generation from images, and joint synthesis.
Published on arXiv with ID 2604.09429.

Rays as Pixels: Joint Video and Camera Trajectory Diffusion Model

Key facts

Entities

Institutions

Sources