UniSonate: Unified AI Model Generates Speech, Music, and Sound Effects from Text

ai-technology · 2026-04-27

Researchers have created a new framework called UniSonate, which aims to streamline the process of generating speech, music, and sound effects using a natural language interface that doesn’t require a reference. This model addresses the fragmentation in generative audio, which is typically divided into tasks like text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each with its own set of controls. UniSonate connects structured forms like speech and music with more chaotic sounds, utilizing a unique dynamic token injection technique. This allows for precise control over sound duration within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT) and uses a multi-stage curriculum learning approach. The study was published on arXiv with the ID 2604.22209.

Key facts

UniSonate is a unified flow-matching framework for audio generation.
It synthesizes speech, music, and sound effects from text instructions.
The model uses a dynamic token injection mechanism for duration control.
It employs a phoneme-driven Multimodal Diffusion Transformer (MM-DiT).
The framework uses a multi-stage curriculum learning strategy.
It unifies TTS, TTM, and TTA tasks.
The paper is available on arXiv (ID 2604.22209).
The approach is reference-free and uses natural language instructions.

UniSonate: Unified AI Model Generates Speech, Music, and Sound Effects from Text

Key facts

Entities

Institutions

Sources