PlanAudio: LLM-Based Unified Speech and Sound Synthesis
Researchers introduce PlanAudio, a unified autoregressive LLM-based framework for generating composite audio with speech and sound from free-form text prompts. The task, Free-Form-Text-Prompt-to-Unified-Audio generation, addresses limitations of disjoint pipelines and structured inputs. PlanAudio leverages intrinsic LLM reasoning to simplify architecture and uses a semantic latent chain-of-thought mechanism for implicit planning. The approach aims to capture fine-grained interactions between speech and sound, enabling natural composites from unconstrained natural language. The paper is available on arXiv under ID 2605.28063.
Key facts
- PlanAudio is an LLM-based framework for unified audio generation.
- It synthesizes speech and sound composites from free-form text prompts.
- The task is called Free-Form-Text-Prompt-to-Unified-Audio generation.
- PlanAudio uses intrinsic LLM reasoning instead of traditional text encoders.
- It introduces a semantic latent chain-of-thought mechanism.
- The approach simplifies model architecture and captures fine-grained interactions.
- The paper is available on arXiv (ID 2605.28063).
- Current methods rely on disjoint pipelines or structured inputs.
Entities
Institutions
- arXiv