Study Finds Audio-Language Models Fail to Use Clinical Context for Dysarthric Speech
A new study published on arXiv (2605.02782) reveals that current audio-language models do not effectively leverage multimodal clinical context to improve automatic speech recognition (ASR) for dysarthric speech. Researchers introduced a benchmark using the Speech Accessibility Project (SAP) dataset, testing whether diagnosis labels, clinician-derived speech ratings, and detailed clinical descriptions enhance transcription accuracy. Across nine models, they found that diagnosis-informed and clinically detailed prompts yielded negligible improvements and often degraded word error rate. The study also explored context-dependent fine-tuning with LoRA adaptation using a mixture of clinical prompt formats, achieving a reduction in word error rate. The findings highlight the brittleness of ASR systems for atypical speech and the need for better integration of clinical context.
Key facts
- Study tests nine audio-language models on dysarthric speech recognition
- Uses Speech Accessibility Project (SAP) dataset
- Clinical context includes diagnosis labels, speech ratings, and descriptions
- Current models do not meaningfully use clinical context
- Diagnosis-informed prompts yield negligible improvements
- Clinically detailed prompts often degrade word error rate
- LoRA adaptation with mixed clinical prompts reduces WER
- Published on arXiv with ID 2605.02782
Entities
Institutions
- arXiv
- Speech Accessibility Project