Discrete Audio and Speech Benchmark (DASB) Introduced to Evaluate Audio Token Performance
A new framework, known as the Discrete Audio and Speech Benchmark (DASB), has been introduced to tackle the inconsistent evaluation practices in the research of discrete audio tokens. These tokens represent a promising method for linking audio processing with language models, enabling multimodal systems to generate and understand audio content. However, challenges arise in preserving vital information such as phonetic details, speaker traits, and paralinguistic features. The benchmark assesses token efficacy across speech, general audio, and music through both discriminative and generative tasks. Results show that discrete representations are generally less robust than their continuous counterparts and necessitate careful tuning of model architecture, data size, learning rates, and capacity. Semantic tokens typically outperform other methods. This research was published on arXiv with the identifier 2406.14294v3 under the replace-cross announcement type.
Key facts
- Discrete Audio and Speech Benchmark (DASB) introduced as comprehensive evaluation framework
- Discrete audio tokens bridge audio and language processing for multimodal models
- Preserving phonetic content, speaker identity, and paralinguistic cues remains challenging
- Benchmark addresses inconsistent evaluation settings across existing studies
- Evaluates tokens across speech, general audio, and music domains
- Tests both discriminative and generative tasks
- Discrete representations found less robust than continuous ones
- Requires careful tuning of model architecture, data size, learning rate, and capacity
Entities
Institutions
- arXiv