ARTFEED — Contemporary Art Intelligence

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

ai-technology · 2026-05-26

A new framework named PANDO has been developed by researchers to enhance the efficiency of multimodal web agents by refining skills during a single rollout. Through an analysis of trajectories from VisualWebArena, they discovered three main inefficiencies: loops of repeated actions, undisclosed discovery expenses, and minimal reuse of prompt caches. PANDO features an organized Skill Library and implements techniques such as progress reflection, confidence-driven skill demotion, hierarchical routing, visual compression, and cache-aware prompting. In testing across 910 tasks in VisualWebArena, PANDO recorded a success rate of 58.3%, surpassing SGV's 54.0% and WALT's 45.2%, while utilizing 58% fewer tokens than SGV and 61% fewer than WALT. The research paper can be found on arXiv with ID 2605.24785.

Key facts

  • PANDO is a single-rollout online skill-distillation framework for multimodal web agents.
  • Three inefficiencies identified: repeat-action loops, hidden discovery costs, low prompt-cache reuse.
  • PANDO uses a structured Skill Library with progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting.
  • Tested on 910 VisualWebArena tasks.
  • Achieved 58.3% success rate.
  • Outperformed SGV (54.0%) and WALT (45.2%).
  • Used 58% fewer tokens than SGV and 61% fewer than WALT.
  • Paper available on arXiv: 2605.24785.

Entities

Institutions

  • arXiv

Sources