MSU-Bench: New Benchmark Tests LLMs on Musical Score Understanding
Researchers have introduced the Musical Score Understanding Benchmark (MSU-Bench), a human-curated dataset designed to evaluate how well Large Language Models (LLMs) and Vision-Language Models (VLMs) comprehend complete musical scores. The benchmark includes 1,800 generative question-answer pairs drawn from works by composers such as Bach, Beethoven, Chopin, and Debussy, covering both textual (ABC notation) and visual (PDF) modalities. Questions are organized into four difficulty levels, from basic onset information to texture and form. Evaluations of over fifteen state-of-the-art models in zero-shot and fine-tuned settings revealed significant modality gaps, unstable performance across difficulty levels, and challenges in maintaining multilevel correctness. Fine-tuning substantially improved results, but overall the study highlights that current models still struggle with integrated musical reasoning. The benchmark aims to advance research in music AI and multimodal understanding.
Key facts
- MSU-Bench is a human-curated benchmark for score-level musical understanding.
- It contains 1,800 generative question-answer pairs.
- Works by Bach, Beethoven, Chopin, Debussy, and others are included.
- Questions span four difficulty levels: onset, rhythm, texture, form.
- Evaluated over fifteen state-of-the-art LLMs and VLMs.
- Modality gaps between text and visual inputs were observed.
- Fine-tuning improved results but challenges remain.
- The study was published on arXiv (2511.20697).
Entities
Artists
- Bach
- Beethoven
- Chopin
- Debussy
Institutions
- arXiv