ARTFEED — Contemporary Art Intelligence

Semantic Fragility in Text-to-Audio Systems Under Prompt Perturbations

ai-technology · 2026-05-07

A recent study investigates the effectiveness of text-to-audio generation models when faced with prompts that maintain similar meanings but differ linguistically. Researchers targeted three distinct systems: MusicGen-small, MusicGen-large, and Stable Audio 2.5. They crafted a dataset comprising 75 prompt groups, varying language through Minimal Lexical Substitution, Intensity Shifts, and Structural Rephrasing. The analysis of the generated audio revealed notable discrepancies in sound and interpretation, highlighting potential reliability issues of these systems in practical use cases. This study emphasizes the need for further refinement in the technology's ability to handle nuanced language variations.

Key facts

  • Study evaluates semantic fragility in text-to-audio generation systems.
  • Models tested: MusicGen-small, MusicGen-large, Stable Audio 2.5.
  • Three perturbation types: Minimal Lexical Substitution, Intensity Shifts, Structural Rephrasing.
  • Dataset includes 75 prompt groups with preserved semantic intent.
  • Outputs compared via spectral, temporal, and semantic similarity measures.
  • Small linguistic changes can cause substantial variation in generated audio.
  • Research highlights reliability concerns for practical use.
  • Published on arXiv with identifier 2603.13824.

Entities

Institutions

  • MusicGen
  • Stable Audio

Sources