PlanAudio: LLM-Based Unified Speech and Sound Synthesis

ai-technology · 2026-05-28

Researchers introduce PlanAudio, a unified autoregressive LLM-based framework for generating composite audio with speech and sound from free-form text prompts. The task, Free-Form-Text-Prompt-to-Unified-Audio generation, addresses limitations of disjoint pipelines and structured inputs. PlanAudio leverages intrinsic LLM reasoning to simplify architecture and uses a semantic latent chain-of-thought mechanism for implicit planning. The approach aims to capture fine-grained interactions between speech and sound, enabling natural composites from unconstrained natural language. The paper is available on arXiv under ID 2605.28063.

Key facts

PlanAudio is an LLM-based framework for unified audio generation.
It synthesizes speech and sound composites from free-form text prompts.
The task is called Free-Form-Text-Prompt-to-Unified-Audio generation.
PlanAudio uses intrinsic LLM reasoning instead of traditional text encoders.
It introduces a semantic latent chain-of-thought mechanism.
The approach simplifies model architecture and captures fine-grained interactions.
The paper is available on arXiv (ID 2605.28063).
Current methods rely on disjoint pipelines or structured inputs.

PlanAudio: LLM-Based Unified Speech and Sound Synthesis

Key facts

Entities

Institutions

Sources