DRS-GUI: Training-Free Dynamic Region Search for GUI Grounding
A new framework called DRS-GUI has been introduced by researchers, designed for GUI grounding without the need for training, and it seamlessly integrates with existing Multimodal Large Language Models (MLLMs). Drawing inspiration from human visual search techniques, DRS-GUI employs a streamlined UI Perceptor that utilizes three perceptual actions: Focus, Shift, and Scatter, to systematically investigate interfaces and create region proposals. An Action Planner, which utilizes Monte Carlo Tree Search (MCTS), dynamically organizes these actions and assesses proposals based on a region quality reward. This approach effectively tackles the difficulties of grounding relevant elements from high-resolution screenshots filled with extraneous UI components, thereby improving the functionality of MLLM-powered GUI agents.
Key facts
- DRS-GUI is a training-free dynamic region search framework for GUI grounding.
- It integrates into existing Multimodal Large Language Models (MLLMs).
- The framework is inspired by how humans dynamically adjust perceptual scope.
- It introduces a lightweight UI Perceptor with three actions: Focus, Shift, and Scatter.
- An Action Planner based on Monte Carlo Tree Search (MCTS) schedules actions.
- A region quality reward evaluates and selects region proposals.
- The method targets high-resolution screenshots with irrelevant UI components.
- The work is published on arXiv with ID 2605.15542.
Entities
Institutions
- arXiv