UniSonate: Unified AI Model Generates Speech, Music, and Sound Effects from Text
Researchers have created a new framework called UniSonate, which aims to streamline the process of generating speech, music, and sound effects using a natural language interface that doesn’t require a reference. This model addresses the fragmentation in generative audio, which is typically divided into tasks like text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each with its own set of controls. UniSonate connects structured forms like speech and music with more chaotic sounds, utilizing a unique dynamic token injection technique. This allows for precise control over sound duration within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT) and uses a multi-stage curriculum learning approach. The study was published on arXiv with the ID 2604.22209.
Key facts
- UniSonate is a unified flow-matching framework for audio generation.
- It synthesizes speech, music, and sound effects from text instructions.
- The model uses a dynamic token injection mechanism for duration control.
- It employs a phoneme-driven Multimodal Diffusion Transformer (MM-DiT).
- The framework uses a multi-stage curriculum learning strategy.
- It unifies TTS, TTM, and TTA tasks.
- The paper is available on arXiv (ID 2604.22209).
- The approach is reference-free and uses natural language instructions.
Entities
Institutions
- arXiv