GaMMA: A New Large Multimodal Model for Comprehensive Music Understanding

ai-technology · 2026-05-04

Researchers have introduced GaMMA, a state-of-the-art large multimodal model (LMM) designed for comprehensive music content understanding. Built on the LLaVA encoder-decoder architecture, GaMMA employs mixture-of-experts audio encoders to unify both time-series and non-time-series music tasks within a single parameter set. The model is trained using a progressive pipeline that includes pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL) on carefully curated large-scale datasets. To evaluate temporal and non-temporal capabilities, the team created MusicBench, the largest music-oriented benchmark with 3,739 human-curated multiple-choice questions. The paper is available on arXiv under identifier 2605.00371.

Key facts

GaMMA is a state-of-the-art large multimodal model for music understanding.
It uses the LLaVA encoder-decoder design.
Mixture-of-experts audio encoders unify time-series and non-time-series tasks.
Training includes pretraining, SFT, and RL.
MusicBench is the largest music benchmark with 3,739 questions.
The paper is on arXiv: 2605.00371.
GaMMA aims for comprehensive musical content understanding.
The model enables cross-modal learning between music and language.

GaMMA: A New Large Multimodal Model for Comprehensive Music Understanding

Key facts

Entities

Institutions

Sources