FMSD-TTS: Few-Shot Multi-Dialect Tibetan Speech Synthesis
A team of researchers has unveiled FMSD-TTS, a groundbreaking text-to-speech system tailored for Tibetan, which includes the U-Tsang, Amdo, and Kham dialects. This system is capable of generating speech in different dialects using just a small amount of reference audio paired with dialect labels. It features a specialized module that blends speaker and dialect characteristics and utilizes a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to accurately capture the nuances of each dialect while preserving the speaker's voice. Evaluations show that FMSD-TTS outperforms current models in both dialect expressiveness and speaker resemblance. The effectiveness of the generated speech has been tested through a challenging speech-to-speech dialect conversion task. You can check out the research on arXiv, ID 2505.14351.
Key facts
- Tibetan is a low-resource language with minimal parallel speech corpora across its three major dialects: U-Tsang, Amdo, and Kham.
- FMSD-TTS is a few-shot, multi-speaker, multi-dialect text-to-speech framework.
- The framework uses limited reference audio and explicit dialect labels.
- It features a speaker-dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net).
- DSDR-Net captures fine-grained acoustic and linguistic variations across dialects while preserving speaker identity.
- FMSD-TTS significantly outperforms baselines in dialectal expressiveness and speaker similarity.
- Synthesized speech is validated through a speech-to-speech dialect conversion task.
- The paper is available on arXiv under ID 2505.14351.
Entities
Institutions
- arXiv