BEA-Dialogue+ Expands Hungarian ASR Training Data to 200 Hours

publication · 2026-06-01

Researchers have released BEA-Dialogue+, an expanded version of the BEA-Dialogue corpus for conversational automatic speech recognition in Hungarian. The original corpus had a strict speaker-disjoint split that limited usable training data to 85 hours. BEA-Dialogue+ relaxes this criterion for experimenters and dialogue partners while keeping primary speakers separate, yielding 200 hours of transcribed natural conversations. This allows a controlled study of the trade-off between more training data and speaker overlap across splits. The team evaluated Whisper- and FastConformer-based models, including Serialized Output Training (SOT) fine-tuning for dialogue transcription. Results show the larger corpus is more challenging for models without fine-tuning. The work addresses the scarcity of publicly available dialogue-style training data for Hungarian ASR.

Key facts

BEA-Dialogue+ is an expanded version of the BEA-Dialogue corpus.
Original corpus had 85 hours of usable data due to strict speaker-disjoint split.
New version provides 200 hours of transcribed natural conversations.
Split criterion relaxed for experimenters and dialogue partners.
Primary speakers remain completely separated.
Whisper- and FastConformer-based models were evaluated.
Serialized Output Training (SOT) fine-tuning was used for dialogue transcription.
Larger corpus is more challenging for models without fine-tuning.

BEA-Dialogue+ Expands Hungarian ASR Training Data to 200 Hours

Key facts

Entities

Institutions

Sources