Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Speech Recognition
A team of researchers has created a system for automatic speech recognition (ASR) and speaker diarization in long-form Bangla by refining existing models. They enhanced the tugstugi/bengaliai-regional-asr-whisper-medium model through fine-tuning on a specialized dataset comprising around 15,000 segmented and aligned Bangla audio files, utilizing comprehensive weight training and data augmentation techniques such as noise injection, reverb, echo, clipping, and pitch/time alterations. For speaker diarization, they adapted pyannote/segmentation-3.0 with PyTorch Lightning, using a competition-annotated dataset and incorporating the refined segmentation backbone into the pyannote/speaker-diarization-community-1 pipeline. This research tackles issues related to understanding spoken Bangla, such as long recordings, varied acoustic environments, and speaker differences.
Key facts
- Fine-tuned tugstugi/bengaliai-regional-asr-whisper-medium for Bangla ASR
- Used custom dataset of ~15,000 Bangla audio segments
- Data augmentation included noise injection, reverb, echo, clipping, pitch/time perturbation
- Fine-tuned pyannote/segmentation-3.0 for speaker diarization
- Used PyTorch Lightning for training
- Integrated fine-tuned segmentation into pyannote/speaker-diarization-community-1 pipeline
- Addresses long-form recordings, diverse acoustic conditions, speaker variability
- Published on arXiv (2605.08214)
Entities
Institutions
- arXiv