Streaming Speech-to-Text Translation with a SpeechLLM
A new LLM-based architecture enables real streaming speech-to-text translation, addressing latency issues in existing SpeechLLM systems. Traditional systems either wait for a complete utterance or output at fixed intervals, unsuitable for real-time applications. The proposed model learns to decide when it has enough audio to emit output tokens, trained using automatic alignments of speech and text. Experiments on multiple language pairs show translation quality close to non-streaming baselines, but with significantly lower latency. The work is published on arXiv under identifier 2605.14766.
Key facts
- Proposed LLM-based architecture for real streaming speech-to-text translation.
- System learns to decide when it has enough audio to output tokens.
- Trained using automatic alignments of input speech and output text.
- Experiments on different language pairs show quality close to non-streaming baseline.
- Addresses latency issues in existing SpeechLLM systems.
- Published on arXiv with identifier 2605.14766.
- Existing systems wait for complete utterance or output at fixed intervals.
- Combines speech recognition and text-to-text translation into a single model.
Entities
Institutions
- arXiv