Chatterbox-Flash: Zero-Shot TTS with Block Diffusion

ai-technology · 2026-06-01

Chatterbox-Flash represents a novel zero-shot text-to-speech model that refines a pretrained autoregressive TTS decoder into a block-diffusion decoder, which allows for parallel token generation within blocks and maintains streaming capability. Researchers discovered that using naive block-diffusion decoding on discrete speech tokens negatively impacts quality due to a long-tail token distribution that favors high-frequency tokens. To overcome this issue without modifying the architecture, they proposed prior-calibrated scoring (which involves subtracting block-level marginal token distribution) and an early-decoding schedule (that adaptively ends iterations based on calibrated confidence). On standard zero-shot TTS benchmarks, Chatterbox-Flash delivers high-fidelity synthesis comparable to robust autoregressive and non-autoregressive models, while also enabling streaming inference. The paper can be found on arXiv under reference 2605.30748.

Key facts

Chatterbox-Flash is a zero-shot TTS model.
It fine-tunes a pretrained autoregressive TTS decoder into a block-diffusion decoder.
Enables parallel token generation within blocks.
Retains block-by-block streaming capability.
Naive block-diffusion degrades quality due to long-tail token distribution.
Introduces prior-calibrated scoring and early-decoding schedule.
Achieves high-fidelity synthesis comparable to strong baselines.
Paper available on arXiv (2605.30748).

Chatterbox-Flash: Zero-Shot TTS with Block Diffusion

Key facts

Entities

Institutions

Sources