Prologue Method Bridges Reconstruction-Generation Gap in AR Image Generation
Researchers have introduced Prologue, an innovative method for autoregressive (AR) image generation that separates the processes of reconstruction and generation by adding a brief sequence of prologue tokens to the visual token array. These prologue tokens are trained solely using AR cross-entropy loss, while the visual tokens focus on reconstruction. In tests on ImageNet 256x256, Prologue-Base lowers gFID from 21.01 to 10.75 without the need for classifier-free guidance, maintaining nearly the same level of reconstruction. Prologue-Large achieves a notable rFID of 0.99 and gFID of 1.46, utilizing a standard AR model without any additional semantic supervision. The approach is defined from an ELBO standpoint.
Key facts
- Prologue is proposed to bridge the reconstruction-generation gap in autoregressive image generation.
- Prologue generates a small set of prologue tokens prepended to the visual token sequence.
- Prologue tokens are trained exclusively with AR cross-entropy loss.
- Visual tokens remain dedicated to reconstruction.
- On ImageNet 256x256, Prologue-Base reduces gFID from 21.01 to 10.75 without classifier-free guidance.
- Prologue-Large achieves rFID of 0.99 and gFID of 1.46 using a standard AR model.
- The approach is formalized from an ELBO perspective.
- No auxiliary semantic supervision is used for Prologue-Large.
Entities
—