Variable Codebook Size Quantization for Autoregressive Visual Generation

ai-technology · 2026-05-09

A new arXiv paper (2605.06207) identifies a fundamental limitation in discrete visual tokenizers that use a constant codebook size across all sequence positions. The authors observe that on ImageNet with K=16384, the per-position conditional entropy drops so rapidly that after just 2 out of 256 positions, the distribution becomes nearly deterministic, turning the remaining 254 positions into a memorization problem. They formalize this as the "Entropy Cliff" with the expression t* = ceil(log2 N / log2 K). Notably, this phenomenon does not occur in language due to its natural structure keeping effective entropy per position below codebook capacity. To address this, the paper proposes Variable Codebook Size Quantization, which adapts the codebook size per position to match the available entropy.

Key facts

Paper ID: arXiv:2605.06207
Announce type: cross
Constant-codebook design hits information-theoretic limit
Per-position conditional entropy decays quickly along sequence
On ImageNet with K=16384, entropy cliff occurs within 2 out of 256 positions
Remaining 254 positions become a memorization problem
Formalized as t* = ceil(log2 N / log2 K)
Phenomenon not observed in language
Proposed solution: Variable Codebook Size Quantization

Variable Codebook Size Quantization for Autoregressive Visual Generation

Key facts

Entities

Institutions

Sources