FMSD-TTS: Few-Shot Multi-Dialect Tibetan Speech Synthesis

other · 2026-04-27

A team of researchers has unveiled FMSD-TTS, a groundbreaking text-to-speech system tailored for Tibetan, which includes the U-Tsang, Amdo, and Kham dialects. This system is capable of generating speech in different dialects using just a small amount of reference audio paired with dialect labels. It features a specialized module that blends speaker and dialect characteristics and utilizes a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to accurately capture the nuances of each dialect while preserving the speaker's voice. Evaluations show that FMSD-TTS outperforms current models in both dialect expressiveness and speaker resemblance. The effectiveness of the generated speech has been tested through a challenging speech-to-speech dialect conversion task. You can check out the research on arXiv, ID 2505.14351.

Key facts

Tibetan is a low-resource language with minimal parallel speech corpora across its three major dialects: U-Tsang, Amdo, and Kham.
FMSD-TTS is a few-shot, multi-speaker, multi-dialect text-to-speech framework.
The framework uses limited reference audio and explicit dialect labels.
It features a speaker-dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net).
DSDR-Net captures fine-grained acoustic and linguistic variations across dialects while preserving speaker identity.
FMSD-TTS significantly outperforms baselines in dialectal expressiveness and speaker similarity.
Synthesized speech is validated through a speech-to-speech dialect conversion task.
The paper is available on arXiv under ID 2505.14351.

FMSD-TTS: Few-Shot Multi-Dialect Tibetan Speech Synthesis

Key facts

Entities

Institutions

Sources