ARTFEED — Contemporary Art Intelligence

Voice Mapping Framework for TTS Quality Assessment

ai-technology · 2026-05-06

A recent study published on arXiv presents voice mapping as a method for assessing the quality of text-to-speech (TTS) synthesis. The researchers evaluated six TTS models—Merlin, Tacotron 2, Transformer TTS, FastSpeech 2, Glow-TTS, and VITS—utilizing metrics such as crest factor, spectrum balance, and cepstral peak prominence (CPPs). The findings reveal that the voice range serves as a crucial indicator of a model's effectiveness, with VITS showcasing the widest range. Although Glow-TTS has a limited voice range, it excels in soft phonation due to its higher spectrum balance. CPPs between 7-8 dB signify natural voice quality, while values above 10 dB result in robotic-sounding speech. This underscores the importance of voice mapping in evaluating vocal effort and TTS systems' management of voice dynamics and expressiveness.

Key facts

  • Study investigates voice mapping as evaluation framework for TTS synthesis quality.
  • Six TTS models analyzed: Merlin, Tacotron 2, Transformer TTS, FastSpeech 2, Glow-TTS, VITS.
  • Metrics used: crest factor, spectrum balance, cepstral peak prominence (CPPs).
  • Voice range is primary indicator of model capability; VITS has largest range.
  • Glow-TTS shows superior soft phonation with higher spectrum balance despite limited voice range.
  • CPPs 7-8 dB indicate natural voice quality; CPPs >10 dB sound robotic.
  • Findings underscore need for voice mapping to evaluate vocal effort and dynamics.
  • Published on arXiv with ID 2605.00861.

Entities

Institutions

  • arXiv

Sources