Mobile World Models Enhance GUI Agent Performance

ai-technology · 2026-05-12

A new study shared on arXiv (2605.10347) delves into how world models can guide mobile GUI agents. The team collected and annotated data from mobile world models, training them in four different areas: delta text, full text, diffusion images, and code that can be rendered. They achieved top results on both MobileWorldBench and Code2WorldBench. When they tested these models on AITZ, AndroidControl, and AndroidWorld, they discovered three key points: renderable code is the best for predicting actions, simulated environments can effectively replace real ones during training, and providing guidance at test time significantly boosts the performance of less advanced agents. This research addresses crucial challenges in complex mobile interactions.

Key facts

Study published on arXiv with ID 2605.10347
World models trained across four modalities: delta text, full text, diffusion-based images, renderable code
Models achieved state-of-the-art on MobileWorldBench and Code2WorldBench
Evaluated on AITZ, AndroidControl, and AndroidWorld
Renderable code reconstruction found most effective for action consequence prediction
Generated rollouts can partially replace real environments
Test-time guidance improves agent performance, especially for weaker agents
Research addresses long-horizon and high-risk mobile interactions

Mobile World Models Enhance GUI Agent Performance

Key facts

Entities

Institutions

Sources