Discrete Audio and Speech Benchmark (DASB) Introduced to Evaluate Audio Token Performance

ai-technology · 2026-04-20

A new framework, known as the Discrete Audio and Speech Benchmark (DASB), has been introduced to tackle the inconsistent evaluation practices in the research of discrete audio tokens. These tokens represent a promising method for linking audio processing with language models, enabling multimodal systems to generate and understand audio content. However, challenges arise in preserving vital information such as phonetic details, speaker traits, and paralinguistic features. The benchmark assesses token efficacy across speech, general audio, and music through both discriminative and generative tasks. Results show that discrete representations are generally less robust than their continuous counterparts and necessitate careful tuning of model architecture, data size, learning rates, and capacity. Semantic tokens typically outperform other methods. This research was published on arXiv with the identifier 2406.14294v3 under the replace-cross announcement type.

Key facts

Discrete Audio and Speech Benchmark (DASB) introduced as comprehensive evaluation framework
Discrete audio tokens bridge audio and language processing for multimodal models
Preserving phonetic content, speaker identity, and paralinguistic cues remains challenging
Benchmark addresses inconsistent evaluation settings across existing studies
Evaluates tokens across speech, general audio, and music domains
Tests both discriminative and generative tasks
Discrete representations found less robust than continuous ones
Requires careful tuning of model architecture, data size, learning rate, and capacity

Discrete Audio and Speech Benchmark (DASB) Introduced to Evaluate Audio Token Performance

Key facts

Entities

Institutions

Sources