Voice Mapping Framework for TTS Quality Assessment

ai-technology · 2026-05-06

A recent study published on arXiv presents voice mapping as a method for assessing the quality of text-to-speech (TTS) synthesis. The researchers evaluated six TTS models—Merlin, Tacotron 2, Transformer TTS, FastSpeech 2, Glow-TTS, and VITS—utilizing metrics such as crest factor, spectrum balance, and cepstral peak prominence (CPPs). The findings reveal that the voice range serves as a crucial indicator of a model's effectiveness, with VITS showcasing the widest range. Although Glow-TTS has a limited voice range, it excels in soft phonation due to its higher spectrum balance. CPPs between 7-8 dB signify natural voice quality, while values above 10 dB result in robotic-sounding speech. This underscores the importance of voice mapping in evaluating vocal effort and TTS systems' management of voice dynamics and expressiveness.

Key facts

Study investigates voice mapping as evaluation framework for TTS synthesis quality.
Six TTS models analyzed: Merlin, Tacotron 2, Transformer TTS, FastSpeech 2, Glow-TTS, VITS.
Metrics used: crest factor, spectrum balance, cepstral peak prominence (CPPs).
Voice range is primary indicator of model capability; VITS has largest range.
Glow-TTS shows superior soft phonation with higher spectrum balance despite limited voice range.
CPPs 7-8 dB indicate natural voice quality; CPPs >10 dB sound robotic.
Findings underscore need for voice mapping to evaluate vocal effort and dynamics.
Published on arXiv with ID 2605.00861.

Voice Mapping Framework for TTS Quality Assessment

Key facts

Entities

Institutions

Sources