New Benchmark TMD-Bench Evaluates Music-Dance Co-Generation
A new benchmark named TMD-Bench has been developed by researchers to assess text-driven music-dance co-generation systems. It focuses on evaluating unimodal generation quality, adherence to instructions, and the alignment of rhythms across different modalities. This benchmark combines computable physical metrics with perceptual judgments across multiple modalities, utilizing a specially curated dataset of rhythm-aligned music and dance, along with a detailed Music Captioner for structured musical semantics. It tackles the complexities of musical rhythm, phrasing, and accents that influence choreographic movement at a precise temporal level, aspects overlooked by unimodal metrics or standard audiovisual consistency evaluations. This research is documented in a paper on arXiv (2605.01809) and seeks to enhance unified audio-visual generation for virtual production and interactive media applications.
Key facts
- TMD-Bench is a benchmark for text-driven music-dance co-generation.
- It evaluates unimodal generation quality, instruction adherence, and cross-modal rhythmic alignment.
- The benchmark integrates computable physical metrics with perceptual multimodal judgments.
- It includes a curated rhythm-aligned music-dance dataset.
- A fine-grained Music Captioner provides structured music semantics.
- The task requires musical rhythm, phrasing, and accents to drive choreographic motion.
- Current unimodal metrics and generic audiovisual scores fail to capture rhythmic coupling.
- The research is published on arXiv with ID 2605.01809.
Entities
Institutions
- arXiv