Wasserstein Gradient Flow Improves Discrete Image Tokenizer Training
A recent paper on arXiv (2605.06148) presents a novel approach for aligning discrete image tokenizers with autoregressive (AR) priors throughout the training process. The conventional two-stage training first focuses on reconstruction, followed by fitting a prior model to static token sequences. This often leads to a discrepancy, as tokens retain image details but are challenging for AR models to predict. The authors investigate this issue using Tripartite Variational Consistency (TVC), which breaks down latent-variable learning into three essential conditions: conditional-likelihood, prior, and posterior consistency. They demonstrate that two-stage training only guarantees the first condition, neglecting prior consistency. To address this, they introduce a distribution-level prior-matching signal via Wasserstein gradient flow during tokenizer training, allowing the tokenizer to produce tokens that are both reconstructive and easily predictable. Their method is tested on standard image generation benchmarks, indicating enhanced AR prior fitting and improved generation quality.
Key facts
- Paper arXiv:2605.06148 proposes Wasserstein gradient flow for discrete image tokenizer training.
- Traditional two-stage tokenizer training decouples reconstruction and prior fitting, causing a mismatch.
- Tripartite Variational Consistency (TVC) framework identifies three consistency conditions.
- Two-stage training only satisfies conditional-likelihood consistency, not prior consistency.
- The new method adds distribution-level prior-matching during tokenizer training.
- Wasserstein gradient flow is used to align token distributions with AR priors.
- The approach improves AR prior fitting and generation quality on standard benchmarks.
- The paper is categorized as cross (computer vision and machine learning).
Entities
Institutions
- arXiv