Ambig-DS Benchmark Exposes Task-Framing Ambiguity in Data-Science Agents
Researchers have introduced Ambig-DS, a benchmark designed to evaluate how data-science agents handle task-framing ambiguity. As agents evolve from co-pilots to auto-pilots, they may silently commit to plausible but unintended task framings, producing clean artifacts that hide incorrect assessments. Ambig-DS comprises two diagnostic suites: Ambig-DS-Target (51 tasks built on DSBench, a tabular modeling benchmark) for prediction-target ambiguity, and Ambig-DS-Objective (61 tasks built on MLE-bench, a Kaggle-style ML competition benchmark) for evaluation-objective ambiguity. Each task pairs an original fully specified version with an ambiguous variant created through controlled edits, verified by a human-and-LLM pipeline to ensure multiple plausible interpretations. The benchmark uses each source benchmark's original evaluator for scoring, aiming to detect whether agents recognize underspecified tasks rather than just whether pipelines run.
Key facts
- Ambig-DS addresses silent misframing in data-science agents
- Two diagnostic suites: Ambig-DS-Target (51 tasks) and Ambig-DS-Objective (61 tasks)
- Built on DSBench and MLE-bench respectively
- Each task has original and ambiguous variants
- Human-and-LLM verification pipeline confirms ambiguity
- Scoring uses original evaluators from source benchmarks
- Focuses on agent recognition of underspecification
- Published on arXiv as 2605.09698v1
Entities
Institutions
- arXiv