ARTFEED — Contemporary Art Intelligence

Microsoft Open-Sources VibeVoice Family of Voice AI Models

ai-technology · 2026-04-28

Microsoft has open-sourced VibeVoice, a family of frontier voice AI models encompassing both text-to-speech (TTS) and automatic speech recognition (ASR). The VibeVoice-ASR model handles up to 60 minutes of long-form audio in a single pass, generating structured transcriptions with speaker diarization, timestamps, and content, supporting over 50 languages and user-customized hotwords. VibeVoice-TTS synthesizes speech up to 90 minutes long with up to four distinct speakers, accepted as an Oral at ICLR 2026. VibeVoice-Realtime is a lightweight 0.5B-parameter model for streaming TTS with ~300ms latency. A core innovation is continuous speech tokenizers operating at 7.5 Hz, using a next-token diffusion framework with a large language model. The VibeVoice-TTS code was removed from the repository after discovery of misuse inconsistent with stated intent. The models are integrated with Hugging Face Transformers and support vLLM inference. Microsoft emphasizes responsible use and warns against deepfakes and disinformation.

Key facts

  • VibeVoice is an open-source family of voice AI models from Microsoft.
  • VibeVoice-ASR handles 60-minute audio in a single pass with structured transcription.
  • VibeVoice-ASR supports over 50 languages and customized hotwords.
  • VibeVoice-TTS synthesizes up to 90 minutes of speech with up to 4 speakers.
  • VibeVoice-TTS was accepted as an Oral at ICLR 2026.
  • VibeVoice-Realtime is a 0.5B-parameter streaming TTS model with ~300ms latency.
  • Core innovation: continuous speech tokenizers at 7.5 Hz with next-token diffusion.
  • VibeVoice-TTS code was removed due to misuse inconsistent with intended use.

Entities

Institutions

  • Microsoft
  • Hugging Face
  • ICLR

Sources