Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Speech Recognition

ai-technology · 2026-05-12

A team of researchers has created a system for automatic speech recognition (ASR) and speaker diarization in long-form Bangla by refining existing models. They enhanced the tugstugi/bengaliai-regional-asr-whisper-medium model through fine-tuning on a specialized dataset comprising around 15,000 segmented and aligned Bangla audio files, utilizing comprehensive weight training and data augmentation techniques such as noise injection, reverb, echo, clipping, and pitch/time alterations. For speaker diarization, they adapted pyannote/segmentation-3.0 with PyTorch Lightning, using a competition-annotated dataset and incorporating the refined segmentation backbone into the pyannote/speaker-diarization-community-1 pipeline. This research tackles issues related to understanding spoken Bangla, such as long recordings, varied acoustic environments, and speaker differences.

Key facts

Fine-tuned tugstugi/bengaliai-regional-asr-whisper-medium for Bangla ASR
Used custom dataset of ~15,000 Bangla audio segments
Data augmentation included noise injection, reverb, echo, clipping, pitch/time perturbation
Fine-tuned pyannote/segmentation-3.0 for speaker diarization
Used PyTorch Lightning for training
Integrated fine-tuned segmentation into pyannote/speaker-diarization-community-1 pipeline
Addresses long-form recordings, diverse acoustic conditions, speaker variability
Published on arXiv (2605.08214)

Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Speech Recognition

Key facts

Entities

Institutions

Sources