X-Voice: Zero-Shot Voice Cloning in 30 Languages
A new multilingual voice cloning model named X-Voice has been unveiled by researchers, featuring 0.4 billion parameters that allow users to communicate in 30 languages. This model was developed using a 420,000-hour multilingual dataset and employs the International Phonetic Alphabet (IPA) for consistent representation, thus removing the necessity for prompt text transcripts through a two-step training approach. In the first stage, X-Voice$_{\text{s1}}$ is created using standard conditional flow-matching training, generating 10,000 hours of audio segments for prompts. The second stage involves fine-tuning these audio pairs with masked prompt text, resulting in X-Voice$_{\text{s2}}$, which can execute zero-shot voice cloning without transcript requirements. The architecture enhances F5-TTS by incorporating dual-level language identifier injection and optimizing Classifier-Free Guidance. This research is documented in a paper available on arXiv (2605.05611).
Key facts
- X-Voice is a 0.4B multilingual zero-shot voice cloning model.
- It enables speaking 30 languages via voice cloning.
- Trained on a 420K-hour multilingual corpus.
- Uses IPA as a unified representation.
- Two-stage training paradigm eliminates reliance on prompt text transcripts.
- Stage 1: X-Voice$_{\text{s1}}$ via conditional flow-matching, synthesizes 10K hours of audio prompts.
- Stage 2: X-Voice$_{\text{s2}}$ fine-tuned on masked audio pairs for zero-shot cloning.
- Architecture extends F5-TTS with dual-level language identifier injection and CFG decoupling/scheduling.
Entities
Institutions
- arXiv