Latent-to-Pixel Transfer Paradigm for Efficient Pixel Diffusion Models

ai-technology · 2026-05-13

A new transfer paradigm called Latent-to-Pixel (L2P) enables efficient training of pixel-space diffusion models by leveraging pre-trained latent diffusion models (LDMs). L2P discards the VAE, uses large-patch tokenization, freezes intermediate layers of the source LDM, and trains only shallow layers to learn the latent-to-pixel transformation. It uses synthetic images generated by the LDM as the sole training corpus, eliminating the need for real data and enabling rapid convergence. The method requires only 8 GPUs and removes the VAE memory bottleneck, allowing native 4K ultra-high resolution generation. The approach is detailed in a paper on arXiv (2605.12013).

Key facts

L2P stands for Latent-to-Pixel transfer paradigm
It uses pre-trained latent diffusion models (LDMs) as a source
The VAE is discarded in favor of large-patch tokenization
Only shallow layers are trained; intermediate layers are frozen
Training uses synthetic images from the LDM, no real data needed
Requires only 8 GPUs for training
Enables native 4K ultra-high resolution generation
Paper available on arXiv with ID 2605.12013

Latent-to-Pixel Transfer Paradigm for Efficient Pixel Diffusion Models

Key facts

Entities

Institutions

Sources