Fine-Grained Speaking Style Control in Prompt-Based TTS Models
Researchers have introduced methods to achieve detailed control over speaking styles in prompt-based text-to-speech (TTS) models. This study tackles the shortcomings of existing models, which typically enforce a uniform style throughout an utterance, limiting applications that demand gradual style shifts over multiple utterances or within a single utterance. To facilitate style interpolation between utterances, the team calculates direction vectors from contrasting style prompts in the embedding space. Additionally, they recognize a significant attention bias toward initial tokens in autoregressive TTS decoders and propose KV-cache manipulation to counteract this issue, allowing for flexible style modifications within a single utterance.
Key facts
- arXiv:2605.27376v1
- Announce Type: cross
- Abstract: prompt-based TTS models enable natural language-driven speaking style control
- Limited fine-grained control and single global style per utterance
- Proposes techniques for inter-utterance style interpolation and intra-utterance style transition
- Inter-utterance: direction vectors between contrastive style prompts in embedding space
- Intra-utterance: identifies attention bias toward early tokens in autoregressive TTS decoders
- Introduces KV-cache manipulation to mitigate attention bias
Entities
—