PANDO: Efficient Multimodal AI Agents via Online Skill Distillation
A new framework named PANDO has been developed by researchers to enhance the efficiency of multimodal web agents by refining skills during a single rollout. Through an analysis of trajectories from VisualWebArena, they discovered three main inefficiencies: loops of repeated actions, undisclosed discovery expenses, and minimal reuse of prompt caches. PANDO features an organized Skill Library and implements techniques such as progress reflection, confidence-driven skill demotion, hierarchical routing, visual compression, and cache-aware prompting. In testing across 910 tasks in VisualWebArena, PANDO recorded a success rate of 58.3%, surpassing SGV's 54.0% and WALT's 45.2%, while utilizing 58% fewer tokens than SGV and 61% fewer than WALT. The research paper can be found on arXiv with ID 2605.24785.
Key facts
- PANDO is a single-rollout online skill-distillation framework for multimodal web agents.
- Three inefficiencies identified: repeat-action loops, hidden discovery costs, low prompt-cache reuse.
- PANDO uses a structured Skill Library with progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting.
- Tested on 910 VisualWebArena tasks.
- Achieved 58.3% success rate.
- Outperformed SGV (54.0%) and WALT (45.2%).
- Used 58% fewer tokens than SGV and 61% fewer than WALT.
- Paper available on arXiv: 2605.24785.
Entities
Institutions
- arXiv