ARTFEED — Contemporary Art Intelligence

FMSD-TTS: Few-Shot Multi-Dialect Tibetan Speech Synthesis

other · 2026-04-27

A team of researchers has unveiled FMSD-TTS, a groundbreaking text-to-speech system tailored for Tibetan, which includes the U-Tsang, Amdo, and Kham dialects. This system is capable of generating speech in different dialects using just a small amount of reference audio paired with dialect labels. It features a specialized module that blends speaker and dialect characteristics and utilizes a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to accurately capture the nuances of each dialect while preserving the speaker's voice. Evaluations show that FMSD-TTS outperforms current models in both dialect expressiveness and speaker resemblance. The effectiveness of the generated speech has been tested through a challenging speech-to-speech dialect conversion task. You can check out the research on arXiv, ID 2505.14351.

Key facts

  • Tibetan is a low-resource language with minimal parallel speech corpora across its three major dialects: U-Tsang, Amdo, and Kham.
  • FMSD-TTS is a few-shot, multi-speaker, multi-dialect text-to-speech framework.
  • The framework uses limited reference audio and explicit dialect labels.
  • It features a speaker-dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net).
  • DSDR-Net captures fine-grained acoustic and linguistic variations across dialects while preserving speaker identity.
  • FMSD-TTS significantly outperforms baselines in dialectal expressiveness and speaker similarity.
  • Synthesized speech is validated through a speech-to-speech dialect conversion task.
  • The paper is available on arXiv under ID 2505.14351.

Entities

Institutions

  • arXiv

Sources