ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer

other · 2026-06-01

A team of researchers has introduced ImmersiveTTS, a text-to-speech model that is aware of its surroundings and produces speech that harmonizes with environmental sounds. Utilizing a multimodal diffusion transformer and joint attention, the model merges transcript-aligned speech latents with text-informed environmental context. To improve semantic coherence, a domain-specific representation alignment objective is employed, drawing on complementary self-supervised representations from both speech and audio encoders. Experimental findings indicate that the model successfully generates natural speech within various environmental settings, effectively tackling issues related to acoustic pattern differences and temporal dynamics.

Key facts

ImmersiveTTS is an environment-aware text-to-speech model.
It generates speech integrated with environmental audio.
The model uses a multimodal diffusion transformer.
Joint attention fuses speech latents with environmental context.
A domain-specific representation alignment objective is introduced.
It leverages self-supervised representations from speech and audio encoders.
The model addresses acoustic pattern disparities and temporal dynamics.
Experimental results show natural speech generation in environmental contexts.

ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer

Key facts

Entities

Institutions

Sources