DRIP-R: Benchmarking LLMs on Real-World Retail Policy Ambiguity
The newly introduced benchmark, DRIP-R, rigorously assesses LLM-based agents' decision-making capabilities amidst real-world policy uncertainties in retail settings. In contrast to current benchmarks that rely on well-defined rules, DRIP-R incorporates scenarios with ambiguous returns that can be interpreted in various ways. It features specially selected policy-ambiguous situations, realistic customer profiles, a comprehensive conversational simulation that allows tool-calling, and a multi-judge assessment system focusing on policy compliance, dialogue effectiveness, behavioral consistency, and resolution quality. Experimental results reveal that leading models often have differing interpretations of the same policies, underscoring a significant gap in evaluation.
Key facts
- DRIP-R is a benchmark for decision-making under real-world policy ambiguity in retail.
- It exploits real-world retail policy ambiguities with no single correct resolution.
- Includes curated return scenarios, customer personas, and conversational simulation.
- Multi-judge evaluation covers policy adherence, dialogue quality, behavioral alignment, and resolution quality.
- Frontier models fundamentally disagree on identical policies.
- Existing agent benchmarks assume unambiguous, well-specified policies.
- LLM-based agents are increasingly deployed for routine retail tasks.
- The benchmark addresses a critical evaluation gap.
Entities
—