PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

ai-technology · 2026-05-26

A new framework named PANDO has been developed by researchers to enhance the efficiency of multimodal web agents by refining skills during a single rollout. Through an analysis of trajectories from VisualWebArena, they discovered three main inefficiencies: loops of repeated actions, undisclosed discovery expenses, and minimal reuse of prompt caches. PANDO features an organized Skill Library and implements techniques such as progress reflection, confidence-driven skill demotion, hierarchical routing, visual compression, and cache-aware prompting. In testing across 910 tasks in VisualWebArena, PANDO recorded a success rate of 58.3%, surpassing SGV's 54.0% and WALT's 45.2%, while utilizing 58% fewer tokens than SGV and 61% fewer than WALT. The research paper can be found on arXiv with ID 2605.24785.

Key facts

PANDO is a single-rollout online skill-distillation framework for multimodal web agents.
Three inefficiencies identified: repeat-action loops, hidden discovery costs, low prompt-cache reuse.
PANDO uses a structured Skill Library with progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting.
Tested on 910 VisualWebArena tasks.
Achieved 58.3% success rate.
Outperformed SGV (54.0%) and WALT (45.2%).
Used 58% fewer tokens than SGV and 61% fewer than WALT.
Paper available on arXiv: 2605.24785.

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

Key facts

Entities

Institutions

Sources