CRePE: Curved Ray Expectation Positional Encoding for Unified Camera Control in Video Generation
A novel technique named Curved Ray Expectation Positional Encoding (CRePE) has been developed to enhance video generation conditioned on camera inputs within the Unified Camera Model, accommodating both wide-angle and fisheye lenses. Current attention-level encodings for cameras either focus solely on ray signals or depend on pinhole geometry, which restricts their use. CRePE encodes each image token as a depth-sensitive positional distribution along its originating ray, effectively capturing the geometry of projected paths influenced by non-pinhole cameras. This method is integrated via a Geometric Attention Adapter into static video diffusion transformers (DiTs), incorporating scene-distance data into specific attention layers and stabilizing it through pseudo supervision. The approach is outlined in a paper on arXiv (2605.12938) and seeks to facilitate dependable positional encoding amidst variations in camera movement, lens settings, and scene structures.
Key facts
- CRePE stands for Curved Ray Expectation Positional Encoding.
- It addresses limitations of existing camera encodings for video generation.
- Supports the Unified Camera Model including wide-angle and fisheye lenses.
- Represents image tokens as depth-aware positional distributions along source rays.
- Implemented via a Geometric Attention Adapter added to frozen video DiTs.
- Injects token-wise scene-distance information into selected attention layers.
- Paper published on arXiv with identifier 2605.12938.
- Aims to improve camera-conditioned video generation reliability.
Entities
Institutions
- arXiv