HyperTrack and GUIEvalKit: Scaling and Benchmarking VLMs for Mobile GUI Navigation

publication · 2026-05-27

A recent research paper on arXiv (2605.27134) thoroughly investigates data scaling, benchmarking, and reasoning in Vision-Language Models (VLMs) for mobile GUI navigation. The team introduces HyperTrack, an extensive dataset featuring over 16,000 real-world tasks from more than 650 Chinese mobile applications. Additionally, they unveil GUIEvalKit, an open-source toolkit designed for standardized benchmarking of VLMs in offline GUI navigation scenarios. The analysis utilizing HyperTrack reveals that reinforcement-based finetuning consistently surpasses supervised finetuning, particularly in out-of-domain contexts, underscoring the beneficial relationship between data scaling and reinforcement learning. Through GUIEvalKit, the researchers evaluate leading VLMs and explore the influence of interaction history and reasoning skills on their performance.

Key facts

Study published on arXiv (2605.27134) about VLMs for mobile GUI navigation.
HyperTrack dataset includes over 16,000 real-world tasks across 650+ Chinese mobile apps.
GUIEvalKit is an open-source toolkit for benchmarking VLMs on offline GUI navigation.
Reinforcement-based finetuning outperforms supervised finetuning, especially out-of-domain.
Data scaling and reinforcement learning show synergistic effects.
Benchmarking includes state-of-the-art VLMs.
Analysis covers interaction history and reasoning capabilities.
Focus on Chinese mobile applications.

Entities

Institutions

arXiv

Locations

China

Sources

arXiv cs.AI — 2026-05-27