Voice Mapping Framework for TTS Quality Assessment
A recent study published on arXiv presents voice mapping as a method for assessing the quality of text-to-speech (TTS) synthesis. The researchers evaluated six TTS models—Merlin, Tacotron 2, Transformer TTS, FastSpeech 2, Glow-TTS, and VITS—utilizing metrics such as crest factor, spectrum balance, and cepstral peak prominence (CPPs). The findings reveal that the voice range serves as a crucial indicator of a model's effectiveness, with VITS showcasing the widest range. Although Glow-TTS has a limited voice range, it excels in soft phonation due to its higher spectrum balance. CPPs between 7-8 dB signify natural voice quality, while values above 10 dB result in robotic-sounding speech. This underscores the importance of voice mapping in evaluating vocal effort and TTS systems' management of voice dynamics and expressiveness.
Key facts
- Study investigates voice mapping as evaluation framework for TTS synthesis quality.
- Six TTS models analyzed: Merlin, Tacotron 2, Transformer TTS, FastSpeech 2, Glow-TTS, VITS.
- Metrics used: crest factor, spectrum balance, cepstral peak prominence (CPPs).
- Voice range is primary indicator of model capability; VITS has largest range.
- Glow-TTS shows superior soft phonation with higher spectrum balance despite limited voice range.
- CPPs 7-8 dB indicate natural voice quality; CPPs >10 dB sound robotic.
- Findings underscore need for voice mapping to evaluate vocal effort and dynamics.
- Published on arXiv with ID 2605.00861.
Entities
Institutions
- arXiv