Video2GUI: Automated GUI Agent Training from Internet Videos
Researchers have introduced Video2GUI, an automated system designed to extract grounded GUI interaction paths from unlabeled videos found online, tackling the challenge of limited large-scale training data for GUI agents. This framework utilizes a coarse-to-fine filtering method to pinpoint high-quality GUI tutorial videos, transforming them into organized agent trajectories. By applying this process to 500 million video metadata records, they created WildGUI, a dataset featuring 12 million interaction trajectories across more than 1,500 applications and websites. The pre-training of Qwen2.5-VL and Mimo-VL on WildGUI shows significant enhancements in GUI agent performance, highlighting the approach's effectiveness.
Key facts
- Video2GUI is a fully automated framework for extracting GUI interaction trajectories from unlabeled Internet videos.
- It uses a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos.
- WildGUI dataset contains 12 million interaction trajectories from over 1,500 applications and websites.
- The pipeline was applied to 500 million video metadata entries.
- Pre-training Qwen2.5-VL and Mimo-VL on WildGUI improves GUI agent performance.
- The research is published as arXiv:2605.14747v1.
- The framework addresses the scarcity of large-scale training data for GUI agents.
- Existing datasets rely on costly manual annotations and are confined to narrow domains.
Entities
Institutions
- arXiv