LLM Decoders Don't Amplify Racial Bias in Speech Recognition, Study Finds
A recent study released on arXiv (2604.21276) investigates the potential for large language model (LLM) decoders in speech recognition to either introduce or exacerbate demographic bias. The researchers evaluated nine models spanning three architectural types: CTC (lacking a language model), encoder-decoder (with implicit LM), and LLM-based (featuring an explicit pretrained decoder). They analyzed approximately 43,000 utterances from the Common Voice 24 and Meta's Fair-Speech dataset, which mitigates vocabulary confounds. The research focused on five demographic factors: ethnicity, accent, gender, age, and first language. Notable results include: LLM decoders did not heighten racial bias (Granite-8B showed the best ethnicity fairness with a max/min WER of 2.28); Whisper exhibited severe hallucination issues with Indian-accented speech, peaking at a 9.62% insertion rate at large-v3; and audio compression was linked to accent fairness. This study questions existing beliefs about bias from LLMs in speech recognition.
Key facts
- Study evaluates nine models across CTC, encoder-decoder, and LLM-based architectures
- Uses about 43,000 utterances from Common Voice 24 and Meta's Fair-Speech dataset
- Examines five demographic axes: ethnicity, accent, gender, age, first language
- Granite-8B has best ethnicity fairness with max/min WER = 2.28
- Whisper exhibits pathological hallucination on Indian-accented speech
- Whisper large-v3 shows non-monotonic insertion-rate spike to 9.62%
- Audio compression predicts accent fairness
- LLM decoders do not amplify racial bias on clean audio
Entities
Institutions
- arXiv
- Meta