New ASR Framework Unifies Offline and Streaming Speech Recognition with Consistency Regularization

ai-technology · 2026-04-22

A unified automatic speech recognition framework for Transducer models has been developed to address the challenge of training a single model that performs effectively in both offline and low-latency streaming environments. The approach incorporates chunk-limited attention with right context and dynamic chunked convolutions to support both decoding modes within one model. To further minimize performance differences between offline and streaming settings, researchers introduced mode-consistency regularization for RNNT, implemented efficiently using Triton. This MCR-RNNT method encourages agreement across different training modes. Experimental results demonstrate that the proposed framework enhances streaming accuracy at low latency while maintaining offline performance. The system also scales effectively to larger model sizes and training datasets. Both the Unified ASR framework and an English model checkpoint have been made publicly available as open-source resources. The research was published on arXiv with the identifier 2604.19079.

Key facts

Unified ASR framework supports both offline and streaming decoding within single model
Uses chunk-limited attention with right context and dynamic chunked convolutions
Introduces mode-consistency regularization for RNNT (MCR-RNNT)
Implemented efficiently using Triton
Improves streaming accuracy at low latency while preserving offline performance
Scales to larger model sizes and training datasets
Framework and English model checkpoint are open-sourced
Research published on arXiv with identifier 2604.19079

New ASR Framework Unifies Offline and Streaming Speech Recognition with Consistency Regularization

Key facts

Entities

Institutions

Sources