OmniGUI Benchmark Tests AI Agents on Audio, Video, and Images
Researchers have introduced OmniGUI, a new benchmark for evaluating AI agents that interact with smartphone graphical user interfaces (GUIs) using multiple input modalities: static images, synchronous audio, and video clips. Unlike existing benchmarks that rely solely on static screenshots, OmniGUI captures the transient audio cues and temporal video dynamics common in real-world smartphone use. The dataset includes 709 expert-demonstrated episodes comprising 2,579 action steps across 29 applications, each annotated with objective multimodal dependency levels. The benchmark is designed to test foundational omni-modal models that can natively process interleaved inputs, as dedicated omni-modal GUI agent frameworks are still in early development. This work aims to bridge the gap between current GUI agent evaluations and the multimodal nature of actual smartphone interactions.
Key facts
- OmniGUI is the first step-level benchmark for GUI agents in omni-modal smartphone environments.
- It provides continuous, interleaved multimodal inputs: static images, synchronous audio, and video clips at every action step.
- The dataset contains 709 expert-demonstrated episodes (2,579 action steps) across 29 applications.
- Each episode is annotated with objective multimodal dependency levels.
- Current benchmarks for GUI agents rely predominantly on static screenshots.
- OmniGUI targets foundational omni-modal models that natively process interleaved inputs.
- Dedicated omni-modal GUI agent frameworks are currently in their nascent stage.
- The benchmark addresses the need to evaluate agents on transient audio cues and temporal video dynamics.
Entities
Institutions
- arXiv