X-Voice: Zero-Shot Voice Cloning in 30 Languages

ai-technology · 2026-05-09

A new multilingual voice cloning model named X-Voice has been unveiled by researchers, featuring 0.4 billion parameters that allow users to communicate in 30 languages. This model was developed using a 420,000-hour multilingual dataset and employs the International Phonetic Alphabet (IPA) for consistent representation, thus removing the necessity for prompt text transcripts through a two-step training approach. In the first stage, X-Voice$_{\text{s1}}$ is created using standard conditional flow-matching training, generating 10,000 hours of audio segments for prompts. The second stage involves fine-tuning these audio pairs with masked prompt text, resulting in X-Voice$_{\text{s2}}$, which can execute zero-shot voice cloning without transcript requirements. The architecture enhances F5-TTS by incorporating dual-level language identifier injection and optimizing Classifier-Free Guidance. This research is documented in a paper available on arXiv (2605.05611).

Key facts

X-Voice is a 0.4B multilingual zero-shot voice cloning model.
It enables speaking 30 languages via voice cloning.
Trained on a 420K-hour multilingual corpus.
Uses IPA as a unified representation.
Two-stage training paradigm eliminates reliance on prompt text transcripts.
Stage 1: X-Voice$_{\text{s1}}$ via conditional flow-matching, synthesizes 10K hours of audio prompts.
Stage 2: X-Voice$_{\text{s2}}$ fine-tuned on masked audio pairs for zero-shot cloning.
Architecture extends F5-TTS with dual-level language identifier injection and CFG decoupling/scheduling.

X-Voice: Zero-Shot Voice Cloning in 30 Languages

Key facts

Entities

Institutions

Sources