MIST: A New Benchmark for Voice-Controlled IoT Assistants

ai-technology · 2026-05-11

A new dataset called MIST (Multimodal Interactive Speech-based Tool-calling Dataset) has been unveiled by researchers, focusing on a synthetic, multi-turn, voice-driven code generation task tailored for IoT devices. This dataset, which can be found on arXiv (2605.06897v1), is designed to address real-world issues such as spatiotemporal limitations, speech input processing, dynamic state management, and mixed-initiative interaction styles. Early assessments indicate a notable performance disparity between multimodal LLMs with open and closed weights, with even the most advanced closed-weight models exhibiting considerable potential for enhancement. Additionally, the team has developed a flexible data generation framework to support ongoing research into mixed-initiative voice assistants for smart homes.

Key facts

MIST is a synthetic multi-turn, voice-driven code generation task for IoT devices.
The dataset addresses spatiotemporal constraints, speech inputs, dynamic state tracking, and mixed-initiative interactions.
A significant gap exists between open- and closed-weight multimodal LLMs on MIST.
Even frontier closed-weight LLMs have substantial headroom for improvement.
An extensible data generation framework is released alongside MIST.
The research is published on arXiv under identifier 2605.06897v1.
The work focuses on voice-based interfaces for smart home IoT devices.
The dataset is designed to facilitate research on mixed-initiative voice assistants.

MIST: A New Benchmark for Voice-Controlled IoT Assistants

Key facts

Entities

Institutions

Sources