DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs
A new training-free policy called Decoder-Only Attention (DOA) enables long-form simultaneous speech-to-text translation using off-the-shelf Speech Large Language Models (SpeechLLMs). Current simultaneous translation systems rely on attention-based encoder-decoder models with cross-attention for alignment, but SpeechLLMs are decoder-only and use self-attention. DOA derives a proxy alignment from self-attention, allowing streaming decisions without additional training. The approach addresses the lack of validation in long-form settings and avoids heuristic wait-k policies. The paper is available on arXiv under reference 2605.31432.
Key facts
- DOA is a training-free policy for simultaneous translation.
- It uses decoder-only SpeechLLMs without cross-attention.
- The policy derives alignment from self-attention.
- It enables long-form simultaneous translation.
- Current methods rely on encoder-decoder models or heuristic wait-k.
- The approach is validated on off-the-shelf models.
- The paper is on arXiv (2605.31432).
- It addresses the gap in long-form settings.
Entities
Institutions
- arXiv