Mobile World Models Enhance GUI Agent Performance
A new study shared on arXiv (2605.10347) delves into how world models can guide mobile GUI agents. The team collected and annotated data from mobile world models, training them in four different areas: delta text, full text, diffusion images, and code that can be rendered. They achieved top results on both MobileWorldBench and Code2WorldBench. When they tested these models on AITZ, AndroidControl, and AndroidWorld, they discovered three key points: renderable code is the best for predicting actions, simulated environments can effectively replace real ones during training, and providing guidance at test time significantly boosts the performance of less advanced agents. This research addresses crucial challenges in complex mobile interactions.
Key facts
- Study published on arXiv with ID 2605.10347
- World models trained across four modalities: delta text, full text, diffusion-based images, renderable code
- Models achieved state-of-the-art on MobileWorldBench and Code2WorldBench
- Evaluated on AITZ, AndroidControl, and AndroidWorld
- Renderable code reconstruction found most effective for action consequence prediction
- Generated rollouts can partially replace real environments
- Test-time guidance improves agent performance, especially for weaker agents
- Research addresses long-horizon and high-risk mobile interactions
Entities
Institutions
- arXiv