Multimodal LLM Improves Conversational Timing with Video and Audio Cues
A team of researchers has introduced MM-When2Speak, a multimodal approach designed to improve large language models' timing in conversations. By combining synchronized video, audio, and text signals, this system transforms the task of determining when to respond into a dense prediction challenge. This enables an agent to choose between remaining silent, giving a brief reply, or initiating a comprehensive response while adhering to streaming limitations. The researchers compiled a multimodal dataset from actual dyadic conversation videos, ensuring that the modalities were temporally aligned and included detailed annotations for reaction types. Testing across different modality configurations and robust LLM benchmarks reveals that MM-When2Speak significantly enhances awareness of conversational timing, addressing a common issue faced by chatbots in delivering timely responses.
Key facts
- MM-When2Speak is a multimodal strategy for LLMs
- It leverages synchronized video, audio, and text cues
- Response timing is reformulated as a dense response-type prediction task
- Agent can decide to remain silent, produce a short reaction, or start a full response
- Curated multimodal dataset from real-world dyadic conversational videos
- Dataset includes temporally aligned modalities and fine-grained reaction type annotations
- Experiments conducted across various modality settings and strong LLM baselines
- Addresses LLM struggle with when to speak in ongoing dialogue
Entities
Institutions
- arXiv