ARTFEED — Contemporary Art Intelligence

Study Finds Audio-Language Models Fail to Use Clinical Context for Dysarthric Speech

ai-technology · 2026-05-06

A new study published on arXiv (2605.02782) reveals that current audio-language models do not effectively leverage multimodal clinical context to improve automatic speech recognition (ASR) for dysarthric speech. Researchers introduced a benchmark using the Speech Accessibility Project (SAP) dataset, testing whether diagnosis labels, clinician-derived speech ratings, and detailed clinical descriptions enhance transcription accuracy. Across nine models, they found that diagnosis-informed and clinically detailed prompts yielded negligible improvements and often degraded word error rate. The study also explored context-dependent fine-tuning with LoRA adaptation using a mixture of clinical prompt formats, achieving a reduction in word error rate. The findings highlight the brittleness of ASR systems for atypical speech and the need for better integration of clinical context.

Key facts

  • Study tests nine audio-language models on dysarthric speech recognition
  • Uses Speech Accessibility Project (SAP) dataset
  • Clinical context includes diagnosis labels, speech ratings, and descriptions
  • Current models do not meaningfully use clinical context
  • Diagnosis-informed prompts yield negligible improvements
  • Clinically detailed prompts often degrade word error rate
  • LoRA adaptation with mixed clinical prompts reduces WER
  • Published on arXiv with ID 2605.02782

Entities

Institutions

  • arXiv
  • Speech Accessibility Project

Sources