GUI-SD: First On-Policy Self-Distillation Framework for GUI Grounding
Researchers have introduced GUI-SD, the inaugural on-policy self-distillation (OPSD) framework designed for grounding graphical user interfaces (GUIs). This framework translates natural language commands into the visual coordinates of specific elements. In contrast to reinforcement learning techniques like GRPO, which necessitate costly multiple rollouts and encounter sparse signals with challenging samples, OPSD delivers dense supervision at the token level from just one rollout. GUI-SD creates a visually enhanced privileged context for the teacher by utilizing a target bounding box and a Gaussian soft mask, thereby providing valuable guidance without revealing precise coordinates. Additionally, it incorporates entropy-guided distillation to dynamically adjust token weights based on digits, targeting improved performance and efficiency for autonomous GUI agents.
Key facts
- GUI-SD is the first OPSD framework for GUI grounding.
- GUI grounding maps natural language instructions to visual coordinates of target elements.
- Reinforcement learning methods like GRPO require expensive multiple rollouts.
- OPSD provides dense token-level supervision from a single rollout.
- GUI-SD uses a target bounding box and Gaussian soft mask for privileged context.
- Entropy-guided distillation adaptively weights tokens based on digit.
- The framework does not leak exact coordinates to the teacher.
- The approach aims to improve performance on hard samples.
Entities
—