ARTFEED — Contemporary Art Intelligence

Fine-Grained Speaking Style Control in Prompt-Based TTS Models

other · 2026-05-28

Researchers have introduced methods to achieve detailed control over speaking styles in prompt-based text-to-speech (TTS) models. This study tackles the shortcomings of existing models, which typically enforce a uniform style throughout an utterance, limiting applications that demand gradual style shifts over multiple utterances or within a single utterance. To facilitate style interpolation between utterances, the team calculates direction vectors from contrasting style prompts in the embedding space. Additionally, they recognize a significant attention bias toward initial tokens in autoregressive TTS decoders and propose KV-cache manipulation to counteract this issue, allowing for flexible style modifications within a single utterance.

Key facts

  • arXiv:2605.27376v1
  • Announce Type: cross
  • Abstract: prompt-based TTS models enable natural language-driven speaking style control
  • Limited fine-grained control and single global style per utterance
  • Proposes techniques for inter-utterance style interpolation and intra-utterance style transition
  • Inter-utterance: direction vectors between contrastive style prompts in embedding space
  • Intra-utterance: identifies attention bias toward early tokens in autoregressive TTS decoders
  • Introduces KV-cache manipulation to mitigate attention bias

Entities

Sources