New ASR Framework Unifies Offline and Streaming Speech Recognition with Consistency Regularization
A unified automatic speech recognition framework for Transducer models has been developed to address the challenge of training a single model that performs effectively in both offline and low-latency streaming environments. The approach incorporates chunk-limited attention with right context and dynamic chunked convolutions to support both decoding modes within one model. To further minimize performance differences between offline and streaming settings, researchers introduced mode-consistency regularization for RNNT, implemented efficiently using Triton. This MCR-RNNT method encourages agreement across different training modes. Experimental results demonstrate that the proposed framework enhances streaming accuracy at low latency while maintaining offline performance. The system also scales effectively to larger model sizes and training datasets. Both the Unified ASR framework and an English model checkpoint have been made publicly available as open-source resources. The research was published on arXiv with the identifier 2604.19079.
Key facts
- Unified ASR framework supports both offline and streaming decoding within single model
- Uses chunk-limited attention with right context and dynamic chunked convolutions
- Introduces mode-consistency regularization for RNNT (MCR-RNNT)
- Implemented efficiently using Triton
- Improves streaming accuracy at low latency while preserving offline performance
- Scales to larger model sizes and training datasets
- Framework and English model checkpoint are open-sourced
- Research published on arXiv with identifier 2604.19079
Entities
Institutions
- arXiv