New Benchmark Evaluates Commercial ASR on Code-Switching Speech
A new research paper presents a benchmark aimed at assessing commercial automatic speech recognition (ASR) systems specifically for code-switching speech. It encompasses four language combinations: Egyptian Arabic–English, Saudi Arabic (Najdi/Hijazi)–English, Persian (Farsi)–English, and German–English. Each dataset comprises 300 samples, curated through a two-step process: initially, a heuristic filter evaluates transcripts based on five structural code-switching indicators, followed by an ensemble of GPT-4o and Gemini 1.5 Pro that assesses candidates across six linguistic dimensions. This approach significantly cuts LLM scoring expenses by around 91% compared to comprehensive scoring methods. The research focuses on the often-overlooked phenomenon of code-switching, where speakers switch languages within a single utterance, and critiques existing benchmarks that only assess clean, monolingual audio with a single Word Error Rate (WER) metric.
Key facts
- Benchmark evaluates five commercial ASR providers.
- Covers four language pairs: Egyptian Arabic–English, Saudi Arabic–English, Persian–English, German–English.
- Each dataset has 300 samples.
- Two-stage pipeline: heuristic filter then LLM ensemble (GPT-4o and Gemini 1.5 Pro).
- Pipeline reduces LLM scoring costs by ~91%.
- Code-switching is alternation between two languages in one utterance.
- Existing benchmarks use clean, monolingual audio and single WER.
- Published on arXiv with ID 2605.19069.
Entities
Institutions
- arXiv