ShopGym: A Framework for Realistic E-Commerce Web Agent Benchmarking
A team of researchers has unveiled ShopGym, a comprehensive framework designed for the realistic simulation and scalable assessment of e-commerce web agents. This framework tackles the dilemma between live storefronts, which are realistic yet unstable and non-reproducible, and custom-built sandbox benchmarks that offer control but lack variety. At the core of ShopGym is ShopArena, which transforms live e-commerce platforms into manageable, observable, and reproducible settings. This framework facilitates the creation of varied evaluation environments that are both realistic and scalable. This innovative approach aims to enhance scientific comparisons in the development of e-commerce web agents. The findings were published on arXiv with the identifier 2605.16116.
Key facts
- ShopGym is a framework for realistic simulation and scalable benchmarking of e-commerce web agents.
- It addresses the tradeoff between live storefronts and hand-built sandbox benchmarks.
- ShopArena is the simulation layer that converts live e-commerce sites into controllable environments.
- Existing methodologies force a tradeoff between realism and reproducibility.
- ShopGym enables diverse, controllable, inspectable, and reproducible evaluation settings.
- The paper was published on arXiv with identifier 2605.16116.
- The core bottleneck identified is methodological: lack of scalable way to construct evaluation settings.
- Live storefronts are non-stationary, difficult to inspect, and irreproducible.
Entities
Institutions
- arXiv