ARTFEED — Contemporary Art Intelligence

BEA-Dialogue+ Expands Hungarian ASR Training Data to 200 Hours

publication · 2026-06-01

Researchers have released BEA-Dialogue+, an expanded version of the BEA-Dialogue corpus for conversational automatic speech recognition in Hungarian. The original corpus had a strict speaker-disjoint split that limited usable training data to 85 hours. BEA-Dialogue+ relaxes this criterion for experimenters and dialogue partners while keeping primary speakers separate, yielding 200 hours of transcribed natural conversations. This allows a controlled study of the trade-off between more training data and speaker overlap across splits. The team evaluated Whisper- and FastConformer-based models, including Serialized Output Training (SOT) fine-tuning for dialogue transcription. Results show the larger corpus is more challenging for models without fine-tuning. The work addresses the scarcity of publicly available dialogue-style training data for Hungarian ASR.

Key facts

  • BEA-Dialogue+ is an expanded version of the BEA-Dialogue corpus.
  • Original corpus had 85 hours of usable data due to strict speaker-disjoint split.
  • New version provides 200 hours of transcribed natural conversations.
  • Split criterion relaxed for experimenters and dialogue partners.
  • Primary speakers remain completely separated.
  • Whisper- and FastConformer-based models were evaluated.
  • Serialized Output Training (SOT) fine-tuning was used for dialogue transcription.
  • Larger corpus is more challenging for models without fine-tuning.

Entities

Institutions

  • arXiv

Sources