Hindi Speech Recognition Achieves 91.79% Accuracy with CNN-Based Keyword Spotting

ai-technology · 2026-05-07

A study on keyword spotting (KWS) for Hindi speech recognition using a dataset of 40,000 audio samples achieves 91.79% accuracy. The system employs Convolutional Neural Networks (CNNs) with Mel Frequency Cepstral Coefficients (MFCCs) as input features. Audio samples were recorded at 44 kHz with an average duration of 1.9 seconds. The approach focuses on on-device, user-specific query recognition while maintaining computational efficiency. Various CNN architectures were tested, with the best performing model reaching the reported accuracy. The work was published on arXiv under computer science and sound categories.

Key facts

Dataset of 40,000 Hindi audio samples used
Sampling rate of 44 kHz
Average audio duration of 1.9 seconds
MFCC features extracted from raw audio
CNN-based classification achieves 91.79% accuracy
Focus on on-device, user-specific keyword spotting
Multiple CNN architectures evaluated
Published on arXiv (ID: 2605.02928)

Hindi Speech Recognition Achieves 91.79% Accuracy with CNN-Based Keyword Spotting

Key facts

Entities

Institutions

Sources