BEA-Dialogue+ Expands Hungarian ASR Training Data to 200 Hours
Researchers have released BEA-Dialogue+, an expanded version of the BEA-Dialogue corpus for conversational automatic speech recognition in Hungarian. The original corpus had a strict speaker-disjoint split that limited usable training data to 85 hours. BEA-Dialogue+ relaxes this criterion for experimenters and dialogue partners while keeping primary speakers separate, yielding 200 hours of transcribed natural conversations. This allows a controlled study of the trade-off between more training data and speaker overlap across splits. The team evaluated Whisper- and FastConformer-based models, including Serialized Output Training (SOT) fine-tuning for dialogue transcription. Results show the larger corpus is more challenging for models without fine-tuning. The work addresses the scarcity of publicly available dialogue-style training data for Hungarian ASR.
Key facts
- BEA-Dialogue+ is an expanded version of the BEA-Dialogue corpus.
- Original corpus had 85 hours of usable data due to strict speaker-disjoint split.
- New version provides 200 hours of transcribed natural conversations.
- Split criterion relaxed for experimenters and dialogue partners.
- Primary speakers remain completely separated.
- Whisper- and FastConformer-based models were evaluated.
- Serialized Output Training (SOT) fine-tuning was used for dialogue transcription.
- Larger corpus is more challenging for models without fine-tuning.
Entities
Institutions
- arXiv