Ambig-DS Benchmark Exposes Task-Framing Ambiguity in Data-Science Agents

other · 2026-05-12

Researchers have introduced Ambig-DS, a benchmark designed to evaluate how data-science agents handle task-framing ambiguity. As agents evolve from co-pilots to auto-pilots, they may silently commit to plausible but unintended task framings, producing clean artifacts that hide incorrect assessments. Ambig-DS comprises two diagnostic suites: Ambig-DS-Target (51 tasks built on DSBench, a tabular modeling benchmark) for prediction-target ambiguity, and Ambig-DS-Objective (61 tasks built on MLE-bench, a Kaggle-style ML competition benchmark) for evaluation-objective ambiguity. Each task pairs an original fully specified version with an ambiguous variant created through controlled edits, verified by a human-and-LLM pipeline to ensure multiple plausible interpretations. The benchmark uses each source benchmark's original evaluator for scoring, aiming to detect whether agents recognize underspecified tasks rather than just whether pipelines run.

Key facts

Ambig-DS addresses silent misframing in data-science agents
Two diagnostic suites: Ambig-DS-Target (51 tasks) and Ambig-DS-Objective (61 tasks)
Built on DSBench and MLE-bench respectively
Each task has original and ambiguous variants
Human-and-LLM verification pipeline confirms ambiguity
Scoring uses original evaluators from source benchmarks
Focuses on agent recognition of underspecification
Published on arXiv as 2605.09698v1

Ambig-DS Benchmark Exposes Task-Framing Ambiguity in Data-Science Agents

Key facts

Entities

Institutions

Sources