Latent-to-Pixel Transfer Paradigm for Efficient Pixel Diffusion Models
A new transfer paradigm called Latent-to-Pixel (L2P) enables efficient training of pixel-space diffusion models by leveraging pre-trained latent diffusion models (LDMs). L2P discards the VAE, uses large-patch tokenization, freezes intermediate layers of the source LDM, and trains only shallow layers to learn the latent-to-pixel transformation. It uses synthetic images generated by the LDM as the sole training corpus, eliminating the need for real data and enabling rapid convergence. The method requires only 8 GPUs and removes the VAE memory bottleneck, allowing native 4K ultra-high resolution generation. The approach is detailed in a paper on arXiv (2605.12013).
Key facts
- L2P stands for Latent-to-Pixel transfer paradigm
- It uses pre-trained latent diffusion models (LDMs) as a source
- The VAE is discarded in favor of large-patch tokenization
- Only shallow layers are trained; intermediate layers are frozen
- Training uses synthetic images from the LDM, no real data needed
- Requires only 8 GPUs for training
- Enables native 4K ultra-high resolution generation
- Paper available on arXiv with ID 2605.12013
Entities
Institutions
- arXiv