AQuaUI: Training-Free Token Reduction for GUI Agents via Adaptive Quadtrees
AQuaUI presents a novel method for reducing tokens during inference for GUI agent models, eliminating the need for training. By utilizing the varying information density found in screenshots, it builds an adaptive quadtree for each input, retaining a single merged token for each leaf. This strategy maintains spatial relationships while minimizing visual tokens without the need for extra training or attention-based compression. It tackles the issue of high-resolution GUI screenshots, where extensive areas may contain minimal information, while critical text and icons demand high accuracy. AQuaUI is suggested as an effective solution for LMM-based GUI agents that incorporate screenshots at every iteration.
Key facts
- AQuaUI is a training-free inference-time token reduction method
- It uses adaptive quadtrees on screenshot inputs
- One representative merged token is kept per leaf of the quadtree
- It preserves spatial positions
- It addresses non-uniform information density in GUI screenshots
- No additional training or attention-based compression is required
- Targets LMM-based GUI agent models
Entities
Institutions
- arXiv