Video2GUI: Automated GUI Agent Training from Internet Videos

ai-technology · 2026-05-16

Researchers have introduced Video2GUI, an automated system designed to extract grounded GUI interaction paths from unlabeled videos found online, tackling the challenge of limited large-scale training data for GUI agents. This framework utilizes a coarse-to-fine filtering method to pinpoint high-quality GUI tutorial videos, transforming them into organized agent trajectories. By applying this process to 500 million video metadata records, they created WildGUI, a dataset featuring 12 million interaction trajectories across more than 1,500 applications and websites. The pre-training of Qwen2.5-VL and Mimo-VL on WildGUI shows significant enhancements in GUI agent performance, highlighting the approach's effectiveness.

Key facts

Video2GUI is a fully automated framework for extracting GUI interaction trajectories from unlabeled Internet videos.
It uses a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos.
WildGUI dataset contains 12 million interaction trajectories from over 1,500 applications and websites.
The pipeline was applied to 500 million video metadata entries.
Pre-training Qwen2.5-VL and Mimo-VL on WildGUI improves GUI agent performance.
The research is published as arXiv:2605.14747v1.
The framework addresses the scarcity of large-scale training data for GUI agents.
Existing datasets rely on costly manual annotations and are confined to narrow domains.

Video2GUI: Automated GUI Agent Training from Internet Videos

Key facts

Entities

Institutions

Sources