ARTFEED — Contemporary Art Intelligence

PilotTTS: Lightweight TTS System Trained on 200K Hours with Open-Source Tools

other · 2026-05-27

PilotTTS has been unveiled by researchers as an efficient autoregressive text-to-speech system that delivers competitive results with just 200K hours of data, all processed using open-source tools. The design features a streamlined architecture and meticulous data engineering, which includes a reproducible multi-stage pipeline for assessing quality, annotating labels, and filtering data. A notable advancement is the implementation of Q-Former-based conditioning, which separates speaker identity from speaking style through cross-sample paired training. Within a single framework, PilotTTS facilitates zero-shot voice cloning, emotion synthesis across 11 categories, and paralinguistic synthesis across 4 categories. This initiative seeks to make it easier for research teams with limited resources by minimizing dependence on extensive proprietary datasets and intricate architectures.

Key facts

  • PilotTTS is a lightweight autoregressive TTS system.
  • Trained on 200K hours of data.
  • All data processing uses open-source tools.
  • Includes a reproducible multi-stage data processing pipeline.
  • Uses Q-Former-based conditioning to decouple speaker identity from speaking style.
  • Supports zero-shot voice cloning.
  • Supports emotion synthesis (11 categories).
  • Supports paralinguistic synthesis (4 categories).

Entities

Sources