ARTFEED — Contemporary Art Intelligence

Chatterbox-Flash: Zero-Shot TTS with Block Diffusion

ai-technology · 2026-06-01

Chatterbox-Flash represents a novel zero-shot text-to-speech model that refines a pretrained autoregressive TTS decoder into a block-diffusion decoder, which allows for parallel token generation within blocks and maintains streaming capability. Researchers discovered that using naive block-diffusion decoding on discrete speech tokens negatively impacts quality due to a long-tail token distribution that favors high-frequency tokens. To overcome this issue without modifying the architecture, they proposed prior-calibrated scoring (which involves subtracting block-level marginal token distribution) and an early-decoding schedule (that adaptively ends iterations based on calibrated confidence). On standard zero-shot TTS benchmarks, Chatterbox-Flash delivers high-fidelity synthesis comparable to robust autoregressive and non-autoregressive models, while also enabling streaming inference. The paper can be found on arXiv under reference 2605.30748.

Key facts

  • Chatterbox-Flash is a zero-shot TTS model.
  • It fine-tunes a pretrained autoregressive TTS decoder into a block-diffusion decoder.
  • Enables parallel token generation within blocks.
  • Retains block-by-block streaming capability.
  • Naive block-diffusion degrades quality due to long-tail token distribution.
  • Introduces prior-calibrated scoring and early-decoding schedule.
  • Achieves high-fidelity synthesis comparable to strong baselines.
  • Paper available on arXiv (2605.30748).

Entities

Institutions

  • arXiv

Sources