New Autoregressive Model Enables Real-Time Target Speaker Extraction with Chunk-Wise Interleaved Splicing
A novel autoregressive model specifically designed for streaming Target Speaker Extraction (TSE) has been introduced to address the limitations of existing generative models. While generative models have achieved high benchmarks in TSE, their dependence on global context prevents real-time application, and direct adaptation to streaming scenarios causes significant performance degradation. The proposed approach implements a Chunk-wise Interleaved Splicing Paradigm to enable efficient and stable streaming inference. To maintain coherence between extracted speech segments, the model incorporates a historical context refinement mechanism that reduces boundary discontinuities by utilizing past information. Experimental validation on the Libri2Mix dataset demonstrates that the new autoregressive model maintains 100% stability and superior intelligibility at low latencies, whereas traditional autoregressive generative baselines show performance decline. This research, documented in arXiv preprint 2604.19635v1, represents the first autoregressive models crafted for streaming TSE, aiming to close the gap between training and real-time inference environments.
Key facts
- Autoregressive models are now tailored for streaming Target Speaker Extraction (TSE).
- Generative models rely on global context, hindering real-time deployment.
- Direct adaptation to streaming often leads to catastrophic inference performance degradation.
- A Chunk-wise Interleaved Splicing Paradigm ensures efficient and stable streaming inference.
- A historical context refinement mechanism mitigates boundary discontinuities by leveraging historical information.
- Experiments on Libri2Mix show the approach maintains 100% stability and superior intelligibility at low latencies.
- Autoregressive generative baselines exhibit performance degradation at low latencies.
- The research is presented in arXiv preprint 2604.19635v1 with an announcement type of cross.
Entities
—