Fine-Grained Speaking Style Control in Prompt-Based TTS Models

other · 2026-05-28

Researchers have introduced methods to achieve detailed control over speaking styles in prompt-based text-to-speech (TTS) models. This study tackles the shortcomings of existing models, which typically enforce a uniform style throughout an utterance, limiting applications that demand gradual style shifts over multiple utterances or within a single utterance. To facilitate style interpolation between utterances, the team calculates direction vectors from contrasting style prompts in the embedding space. Additionally, they recognize a significant attention bias toward initial tokens in autoregressive TTS decoders and propose KV-cache manipulation to counteract this issue, allowing for flexible style modifications within a single utterance.

Key facts

arXiv:2605.27376v1
Announce Type: cross
Abstract: prompt-based TTS models enable natural language-driven speaking style control
Limited fine-grained control and single global style per utterance
Proposes techniques for inter-utterance style interpolation and intra-utterance style transition
Inter-utterance: direction vectors between contrastive style prompts in embedding space
Intra-utterance: identifies attention bias toward early tokens in autoregressive TTS decoders
Introduces KV-cache manipulation to mitigate attention bias

Entities

—

Sources

arXiv cs.AI — 2026-05-28