Chatterbox-Flash: Zero-Shot TTS with Block Diffusion
Chatterbox-Flash represents a novel zero-shot text-to-speech model that refines a pretrained autoregressive TTS decoder into a block-diffusion decoder, which allows for parallel token generation within blocks and maintains streaming capability. Researchers discovered that using naive block-diffusion decoding on discrete speech tokens negatively impacts quality due to a long-tail token distribution that favors high-frequency tokens. To overcome this issue without modifying the architecture, they proposed prior-calibrated scoring (which involves subtracting block-level marginal token distribution) and an early-decoding schedule (that adaptively ends iterations based on calibrated confidence). On standard zero-shot TTS benchmarks, Chatterbox-Flash delivers high-fidelity synthesis comparable to robust autoregressive and non-autoregressive models, while also enabling streaming inference. The paper can be found on arXiv under reference 2605.30748.
Key facts
- Chatterbox-Flash is a zero-shot TTS model.
- It fine-tunes a pretrained autoregressive TTS decoder into a block-diffusion decoder.
- Enables parallel token generation within blocks.
- Retains block-by-block streaming capability.
- Naive block-diffusion degrades quality due to long-tail token distribution.
- Introduces prior-calibrated scoring and early-decoding schedule.
- Achieves high-fidelity synthesis comparable to strong baselines.
- Paper available on arXiv (2605.30748).
Entities
Institutions
- arXiv