DRIP-R: Benchmarking LLMs on Real-World Retail Policy Ambiguity

ai-technology · 2026-05-11

The newly introduced benchmark, DRIP-R, rigorously assesses LLM-based agents' decision-making capabilities amidst real-world policy uncertainties in retail settings. In contrast to current benchmarks that rely on well-defined rules, DRIP-R incorporates scenarios with ambiguous returns that can be interpreted in various ways. It features specially selected policy-ambiguous situations, realistic customer profiles, a comprehensive conversational simulation that allows tool-calling, and a multi-judge assessment system focusing on policy compliance, dialogue effectiveness, behavioral consistency, and resolution quality. Experimental results reveal that leading models often have differing interpretations of the same policies, underscoring a significant gap in evaluation.

Key facts

DRIP-R is a benchmark for decision-making under real-world policy ambiguity in retail.
It exploits real-world retail policy ambiguities with no single correct resolution.
Includes curated return scenarios, customer personas, and conversational simulation.
Multi-judge evaluation covers policy adherence, dialogue quality, behavioral alignment, and resolution quality.
Frontier models fundamentally disagree on identical policies.
Existing agent benchmarks assume unambiguous, well-specified policies.
LLM-based agents are increasingly deployed for routine retail tasks.
The benchmark addresses a critical evaluation gap.

Entities

—

Sources

arXiv cs.AI — 2026-05-11