ARTFEED — Contemporary Art Intelligence

Rays as Pixels: Joint Video and Camera Trajectory Diffusion Model

ai-technology · 2026-04-24

A new video diffusion model (VDM) called Rays as Pixels has been introduced by researchers, which captures a joint distribution of videos and camera movements. This innovative model is the first to simultaneously forecast camera poses and enable camera-controlled video creation within one framework. It encodes each camera as dense ray pixels (raxels), aligning pixel representation with the latent space of video frames, and utilizes a Decoupled Self-Cross Attention mechanism for joint denoising. The model efficiently performs three functions: it predicts camera trajectories from videos, generates videos from given images along a specified path, and synthesizes both video and trajectory together. The research can be found on arXiv with ID 2604.09429.

Key facts

  • First model to combine camera pose prediction and camera-controlled video generation in one framework.
  • Uses dense ray pixels (raxels) as a pixel-aligned encoding for cameras.
  • Employs Decoupled Self-Cross Attention mechanism for joint denoising.
  • Handles three tasks: trajectory prediction, video generation from images, and joint synthesis.
  • Published on arXiv with ID 2604.09429.

Entities

Institutions

  • arXiv

Sources