DLLM-VSR: First Diffusion LLM for Visual Speech Recognition
Researchers propose DLLM-VSR, the first Diffusion Large Language Model (DLLM) framework for Visual Speech Recognition (VSR). Unlike traditional left-to-right autoregressive decoders, DLLM-VSR uses iterative masked denoising with flexible-order decoding, allowing high-confidence tokens to be committed early and used as bidirectional context to refine ambiguous ones. A two-stage masked-denoising training strategy separates visual-to-text alignment from length modeling. The study identifies a performance gap with oracle-length decoding, suggesting that reducing target-length uncertainty can improve DLLM-based VSR. The paper is available on arXiv under ID 2605.28456.
Key facts
- DLLM-VSR is the first Diffusion Large Language Model-based VSR framework.
- It uses iterative masked denoising instead of left-to-right autoregressive decoding.
- Confidence-based unmasking commits high-confidence positions early.
- Two-stage training separates content alignment from length modeling.
- Oracle-length decoding reveals a performance gap due to target-length uncertainty.
- The paper is on arXiv with ID 2605.28456.
Entities
Institutions
- arXiv