Multimodal LLM Improves Conversational Timing with Video and Audio Cues

ai-technology · 2026-05-22

A team of researchers has introduced MM-When2Speak, a multimodal approach designed to improve large language models' timing in conversations. By combining synchronized video, audio, and text signals, this system transforms the task of determining when to respond into a dense prediction challenge. This enables an agent to choose between remaining silent, giving a brief reply, or initiating a comprehensive response while adhering to streaming limitations. The researchers compiled a multimodal dataset from actual dyadic conversation videos, ensuring that the modalities were temporally aligned and included detailed annotations for reaction types. Testing across different modality configurations and robust LLM benchmarks reveals that MM-When2Speak significantly enhances awareness of conversational timing, addressing a common issue faced by chatbots in delivering timely responses.

Key facts

MM-When2Speak is a multimodal strategy for LLMs
It leverages synchronized video, audio, and text cues
Response timing is reformulated as a dense response-type prediction task
Agent can decide to remain silent, produce a short reaction, or start a full response
Curated multimodal dataset from real-world dyadic conversational videos
Dataset includes temporally aligned modalities and fine-grained reaction type annotations
Experiments conducted across various modality settings and strong LLM baselines
Addresses LLM struggle with when to speak in ongoing dialogue

Multimodal LLM Improves Conversational Timing with Video and Audio Cues

Key facts

Entities

Institutions

Sources