GaMMA: A New Large Multimodal Model for Comprehensive Music Understanding
Researchers have introduced GaMMA, a state-of-the-art large multimodal model (LMM) designed for comprehensive music content understanding. Built on the LLaVA encoder-decoder architecture, GaMMA employs mixture-of-experts audio encoders to unify both time-series and non-time-series music tasks within a single parameter set. The model is trained using a progressive pipeline that includes pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL) on carefully curated large-scale datasets. To evaluate temporal and non-temporal capabilities, the team created MusicBench, the largest music-oriented benchmark with 3,739 human-curated multiple-choice questions. The paper is available on arXiv under identifier 2605.00371.
Key facts
- GaMMA is a state-of-the-art large multimodal model for music understanding.
- It uses the LLaVA encoder-decoder design.
- Mixture-of-experts audio encoders unify time-series and non-time-series tasks.
- Training includes pretraining, SFT, and RL.
- MusicBench is the largest music benchmark with 3,739 questions.
- The paper is on arXiv: 2605.00371.
- GaMMA aims for comprehensive musical content understanding.
- The model enables cross-modal learning between music and language.
Entities
Institutions
- arXiv