ARTFEED — Contemporary Art Intelligence

GaMMA: A New Large Multimodal Model for Comprehensive Music Understanding

ai-technology · 2026-05-04

Researchers have introduced GaMMA, a state-of-the-art large multimodal model (LMM) designed for comprehensive music content understanding. Built on the LLaVA encoder-decoder architecture, GaMMA employs mixture-of-experts audio encoders to unify both time-series and non-time-series music tasks within a single parameter set. The model is trained using a progressive pipeline that includes pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL) on carefully curated large-scale datasets. To evaluate temporal and non-temporal capabilities, the team created MusicBench, the largest music-oriented benchmark with 3,739 human-curated multiple-choice questions. The paper is available on arXiv under identifier 2605.00371.

Key facts

  • GaMMA is a state-of-the-art large multimodal model for music understanding.
  • It uses the LLaVA encoder-decoder design.
  • Mixture-of-experts audio encoders unify time-series and non-time-series tasks.
  • Training includes pretraining, SFT, and RL.
  • MusicBench is the largest music benchmark with 3,739 questions.
  • The paper is on arXiv: 2605.00371.
  • GaMMA aims for comprehensive musical content understanding.
  • The model enables cross-modal learning between music and language.

Entities

Institutions

  • arXiv

Sources