ViTok-v2: 5B Parameter Image Tokenizer with Native Resolution Support
ViTok-v2, an advanced Vision Transformer autoencoder with 5 billion parameters, provides native resolution support and ensures stable training without the use of adversarial losses. This development builds on ViTok (Hansen-Estruch et al., 2025), which revealed a trade-off between reconstruction and generation influenced by the compression ratio r. ViTok-v2 incorporates NaFlex for better generalization across various resolutions and aspect ratios, alongside a DINOv3 perceptual loss that replaces LPIPS and GAN objectives. With training conducted on around 2 billion images, it stands as the largest image autoencoder to date. This research addresses the shortcomings of earlier ViT tokenizers that struggle beyond training resolutions and depend on adversarial losses for stability. The paper can be found on arXiv under reference 2605.05331.
Key facts
- ViTok-v2 is a Vision Transformer autoencoder with 5 billion parameters.
- It supports native resolution and aspect ratio generalization via NaFlex.
- A novel DINOv3 perceptual loss replaces LPIPS and GAN objectives.
- Trained on about 2 billion images.
- It is the largest image autoencoder to date.
- Builds on ViTok (Hansen-Estruch et al., 2025).
- Addresses performance degradation outside training resolutions.
- Enables stable scaling without adversarial losses.
Entities
Artists
- Hansen-Estruch