Variable Codebook Size Quantization for Autoregressive Visual Generation
A new arXiv paper (2605.06207) identifies a fundamental limitation in discrete visual tokenizers that use a constant codebook size across all sequence positions. The authors observe that on ImageNet with K=16384, the per-position conditional entropy drops so rapidly that after just 2 out of 256 positions, the distribution becomes nearly deterministic, turning the remaining 254 positions into a memorization problem. They formalize this as the "Entropy Cliff" with the expression t* = ceil(log2 N / log2 K). Notably, this phenomenon does not occur in language due to its natural structure keeping effective entropy per position below codebook capacity. To address this, the paper proposes Variable Codebook Size Quantization, which adapts the codebook size per position to match the available entropy.
Key facts
- Paper ID: arXiv:2605.06207
- Announce type: cross
- Constant-codebook design hits information-theoretic limit
- Per-position conditional entropy decays quickly along sequence
- On ImageNet with K=16384, entropy cliff occurs within 2 out of 256 positions
- Remaining 254 positions become a memorization problem
- Formalized as t* = ceil(log2 N / log2 K)
- Phenomenon not observed in language
- Proposed solution: Variable Codebook Size Quantization
Entities
Institutions
- arXiv