PilotTTS: Lightweight TTS System Trained on 200K Hours with Open-Source Tools

other · 2026-05-27

PilotTTS has been unveiled by researchers as an efficient autoregressive text-to-speech system that delivers competitive results with just 200K hours of data, all processed using open-source tools. The design features a streamlined architecture and meticulous data engineering, which includes a reproducible multi-stage pipeline for assessing quality, annotating labels, and filtering data. A notable advancement is the implementation of Q-Former-based conditioning, which separates speaker identity from speaking style through cross-sample paired training. Within a single framework, PilotTTS facilitates zero-shot voice cloning, emotion synthesis across 11 categories, and paralinguistic synthesis across 4 categories. This initiative seeks to make it easier for research teams with limited resources by minimizing dependence on extensive proprietary datasets and intricate architectures.

Key facts

PilotTTS is a lightweight autoregressive TTS system.
Trained on 200K hours of data.
All data processing uses open-source tools.
Includes a reproducible multi-stage data processing pipeline.
Uses Q-Former-based conditioning to decouple speaker identity from speaking style.
Supports zero-shot voice cloning.
Supports emotion synthesis (11 categories).
Supports paralinguistic synthesis (4 categories).

Entities

—

Sources

arXiv cs.AI — 2026-05-27