ARTFEED — Contemporary Art Intelligence

Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Speech Recognition

ai-technology · 2026-05-12

A team of researchers has created a system for automatic speech recognition (ASR) and speaker diarization in long-form Bangla by refining existing models. They enhanced the tugstugi/bengaliai-regional-asr-whisper-medium model through fine-tuning on a specialized dataset comprising around 15,000 segmented and aligned Bangla audio files, utilizing comprehensive weight training and data augmentation techniques such as noise injection, reverb, echo, clipping, and pitch/time alterations. For speaker diarization, they adapted pyannote/segmentation-3.0 with PyTorch Lightning, using a competition-annotated dataset and incorporating the refined segmentation backbone into the pyannote/speaker-diarization-community-1 pipeline. This research tackles issues related to understanding spoken Bangla, such as long recordings, varied acoustic environments, and speaker differences.

Key facts

  • Fine-tuned tugstugi/bengaliai-regional-asr-whisper-medium for Bangla ASR
  • Used custom dataset of ~15,000 Bangla audio segments
  • Data augmentation included noise injection, reverb, echo, clipping, pitch/time perturbation
  • Fine-tuned pyannote/segmentation-3.0 for speaker diarization
  • Used PyTorch Lightning for training
  • Integrated fine-tuned segmentation into pyannote/speaker-diarization-community-1 pipeline
  • Addresses long-form recordings, diverse acoustic conditions, speaker variability
  • Published on arXiv (2605.08214)

Entities

Institutions

  • arXiv

Sources