ARTFEED — Contemporary Art Intelligence

BandTok: A 2D Mel-Spectrogram Tokenizer for Music Generation

ai-technology · 2026-05-18

A team of researchers has introduced BandTok, a 2D Mel-spectrogram tokenizer aimed at enhancing music generation for different generations. In contrast to current high-fidelity codecs that rely on residual multi-codebook quantization, BandTok utilizes Mel-frequency band tokens from a single shared codebook for each frame. This results in a time-frequency token grid that is physically interpretable and features a more independent token structure, minimizing sequential dependencies and reducing error accumulation. BandTok enhances reconstruction through a multi-scale PatchGAN objective and EMA codebook updates. Additionally, the authors present an autoregressive language model incorporating 2D Rotary Position Embedding (2D RoPE) to maintain the structure of temporal and frequency bands during generation. The full paper can be accessed on arXiv.

Key facts

  • BandTok is a 2D Mel-spectrogram tokenizer for music generation.
  • It uses a single shared codebook for Mel-frequency band tokens.
  • The tokenizer creates a time-frequency token grid.
  • It employs multi-scale PatchGAN and EMA codebook updates.
  • The language model uses 2D Rotary Position Embedding (2D RoPE).
  • The paper is on arXiv with ID 2605.15831.
  • It addresses error accumulation in residual multi-codebook quantization.
  • The approach is generation-oriented for autoregressive modeling.

Entities

Institutions

  • arXiv

Sources