New Amazon-Bench Benchmark Addresses Gaps in Web Agent Evaluation for E-commerce

ai-technology · 2026-04-22

A new benchmark called Amazon-Bench has been proposed to address significant limitations in current evaluations of web agents operating on e-commerce platforms. Existing benchmarks primarily focus on product search tasks like finding specific items, failing to capture the broader range of functionalities available on real-world platforms such as Amazon. These functionalities include account management operations and gift card handling. Current evaluation methods typically assess only whether an agent completes a user query, ignoring potential risks involved in practical applications. Web agents can make unintended changes that negatively impact user accounts, such as purchasing incorrect items, deleting saved addresses, or incorrectly configuring auto-reload settings. The benchmark aims to generate user queries that better reflect the comprehensive functionality of e-commerce platforms while incorporating risk assessment into the evaluation framework. This addresses two major problems identified in current e-commerce domain benchmarks. The research is documented in arXiv:2508.15832v2 under the announcement type replace-cross.

Key facts

A new benchmark called Amazon-Bench has been proposed for evaluating web agents
Current e-commerce benchmarks primarily focus on product search tasks
Existing benchmarks fail to capture broader platform functionalities like account management
Current evaluations ignore potential risks from unintended agent actions
Web agents can negatively impact user accounts through incorrect purchases or settings
The benchmark addresses gaps in assessing real-world e-commerce platform operations
The research is documented as arXiv:2508.15832v2
The announcement type is replace-cross

New Amazon-Bench Benchmark Addresses Gaps in Web Agent Evaluation for E-commerce

Key facts

Entities

Institutions

Sources