Text-Only Data Integration for Encoder-Dominated Speech Recognition

other · 2026-04-30

This study, found in the arXiv Computer Science > Computation and Language category, explores effective strategies for enhancing speech recognition through the use of text-only data, with an emphasis on encoder-centric models that facilitate quicker recognition. The authors present an extensive evaluation of various methods for incorporating text-only data, such as modality matching and dynamic downsampling, to create text-level representations in the encoder. Results from experiments using the LibriSpeech corpus indicate that employing a larger encoder paired with a smaller decoder can achieve performance levels comparable to or better than those of models with larger decoders. Furthermore, simpler setups, like random duration models, often outperform more intricate ones, greatly streamlining the training process. All related code and methodologies are publicly accessible.

Key facts

The paper is from arXiv Computer Science > Computation and Language.
It focuses on encoder-dominated speech recognition models.
Techniques include modality matching and dynamic downsampling.
Experiments use the LibriSpeech corpus.
A larger encoder with a smaller decoder can equal or surpass larger decoder architectures.
Simple random duration models are often more effective than complex alternatives.
The training pipeline is simplified.
All code and recipes are publicly available.

Text-Only Data Integration for Encoder-Dominated Speech Recognition

Key facts

Entities

Institutions

Sources