Khala: A Two-Stage Framework for High-Fidelity Music Generation Using Acoustic Token Language Models

ai-technology · 2026-05-06

A research paper on arXiv (2605.01790) introduces Khala, a novel framework for music generation that models both structure and fidelity within a single deep acoustic-token hierarchy. The system uses a 64-layer residual vector quantization (RVQ) acoustic representation and a two-stage coarse-to-fine generation process. A backbone model generates coarse acoustic tokens for the full track, followed by a super-resolution model that refines finer tokens layer by layer in parallel over time, resulting in a fixed 62-step inference process. This approach aims to improve lyric alignment and fine-detail reconstruction, challenging the common design pattern of separate representation spaces for structure and fidelity.

Key facts

Khala uses a 64-layer RVQ acoustic representation.
The framework has a two-stage coarse-to-fine generation process.
A backbone model generates coarse acoustic tokens for the full track.
A super-resolution model refines finer tokens within the same acoustic token space.
The super-resolution stage works at full-track scale and runs in parallel over time.
The inference process is fixed at 62 steps.
The paper is available on arXiv with ID 2605.01790.
The approach aims to improve lyric alignment and fine-detail reconstruction.

Khala: A Two-Stage Framework for High-Fidelity Music Generation Using Acoustic Token Language Models

Key facts

Entities

Institutions

Sources