DLLM-VSR: First Diffusion LLM for Visual Speech Recognition

ai-technology · 2026-05-28

Researchers propose DLLM-VSR, the first Diffusion Large Language Model (DLLM) framework for Visual Speech Recognition (VSR). Unlike traditional left-to-right autoregressive decoders, DLLM-VSR uses iterative masked denoising with flexible-order decoding, allowing high-confidence tokens to be committed early and used as bidirectional context to refine ambiguous ones. A two-stage masked-denoising training strategy separates visual-to-text alignment from length modeling. The study identifies a performance gap with oracle-length decoding, suggesting that reducing target-length uncertainty can improve DLLM-based VSR. The paper is available on arXiv under ID 2605.28456.

Key facts

DLLM-VSR is the first Diffusion Large Language Model-based VSR framework.
It uses iterative masked denoising instead of left-to-right autoregressive decoding.
Confidence-based unmasking commits high-confidence positions early.
Two-stage training separates content alignment from length modeling.
Oracle-length decoding reveals a performance gap due to target-length uncertainty.
The paper is on arXiv with ID 2605.28456.

DLLM-VSR: First Diffusion LLM for Visual Speech Recognition

Key facts

Entities

Institutions

Sources