ARTFEED — Contemporary Art Intelligence

Video2GUI: Automated GUI Agent Training from Internet Videos

ai-technology · 2026-05-16

Researchers have introduced Video2GUI, an automated system designed to extract grounded GUI interaction paths from unlabeled videos found online, tackling the challenge of limited large-scale training data for GUI agents. This framework utilizes a coarse-to-fine filtering method to pinpoint high-quality GUI tutorial videos, transforming them into organized agent trajectories. By applying this process to 500 million video metadata records, they created WildGUI, a dataset featuring 12 million interaction trajectories across more than 1,500 applications and websites. The pre-training of Qwen2.5-VL and Mimo-VL on WildGUI shows significant enhancements in GUI agent performance, highlighting the approach's effectiveness.

Key facts

  • Video2GUI is a fully automated framework for extracting GUI interaction trajectories from unlabeled Internet videos.
  • It uses a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos.
  • WildGUI dataset contains 12 million interaction trajectories from over 1,500 applications and websites.
  • The pipeline was applied to 500 million video metadata entries.
  • Pre-training Qwen2.5-VL and Mimo-VL on WildGUI improves GUI agent performance.
  • The research is published as arXiv:2605.14747v1.
  • The framework addresses the scarcity of large-scale training data for GUI agents.
  • Existing datasets rely on costly manual annotations and are confined to narrow domains.

Entities

Institutions

  • arXiv

Sources