ARTFEED — Contemporary Art Intelligence

PlanAudio: LLM-Based Unified Speech and Sound Synthesis

ai-technology · 2026-05-28

Researchers introduce PlanAudio, a unified autoregressive LLM-based framework for generating composite audio with speech and sound from free-form text prompts. The task, Free-Form-Text-Prompt-to-Unified-Audio generation, addresses limitations of disjoint pipelines and structured inputs. PlanAudio leverages intrinsic LLM reasoning to simplify architecture and uses a semantic latent chain-of-thought mechanism for implicit planning. The approach aims to capture fine-grained interactions between speech and sound, enabling natural composites from unconstrained natural language. The paper is available on arXiv under ID 2605.28063.

Key facts

  • PlanAudio is an LLM-based framework for unified audio generation.
  • It synthesizes speech and sound composites from free-form text prompts.
  • The task is called Free-Form-Text-Prompt-to-Unified-Audio generation.
  • PlanAudio uses intrinsic LLM reasoning instead of traditional text encoders.
  • It introduces a semantic latent chain-of-thought mechanism.
  • The approach simplifies model architecture and captures fine-grained interactions.
  • The paper is available on arXiv (ID 2605.28063).
  • Current methods rely on disjoint pipelines or structured inputs.

Entities

Institutions

  • arXiv

Sources