Study Finds Audio-Language Models Fail to Use Clinical Context for Dysarthric Speech

ai-technology · 2026-05-06

A new study published on arXiv (2605.02782) reveals that current audio-language models do not effectively leverage multimodal clinical context to improve automatic speech recognition (ASR) for dysarthric speech. Researchers introduced a benchmark using the Speech Accessibility Project (SAP) dataset, testing whether diagnosis labels, clinician-derived speech ratings, and detailed clinical descriptions enhance transcription accuracy. Across nine models, they found that diagnosis-informed and clinically detailed prompts yielded negligible improvements and often degraded word error rate. The study also explored context-dependent fine-tuning with LoRA adaptation using a mixture of clinical prompt formats, achieving a reduction in word error rate. The findings highlight the brittleness of ASR systems for atypical speech and the need for better integration of clinical context.

Key facts

Study tests nine audio-language models on dysarthric speech recognition
Uses Speech Accessibility Project (SAP) dataset
Clinical context includes diagnosis labels, speech ratings, and descriptions
Current models do not meaningfully use clinical context
Diagnosis-informed prompts yield negligible improvements
Clinically detailed prompts often degrade word error rate
LoRA adaptation with mixed clinical prompts reduces WER
Published on arXiv with ID 2605.02782

Study Finds Audio-Language Models Fail to Use Clinical Context for Dysarthric Speech

Key facts

Entities

Institutions

Sources