D3-Gym: A Benchmark for Verifiable Scientific Data-Driven Discovery
The introduction of a new dataset, named D3-Gym, aims to fill the gap in verifiable environments for scientific data-driven discovery tasks. D3-Gym is notable for being the first dataset automatically created, comprising 565 tasks drawn from 239 authentic scientific repositories across four fields. Each task features a natural language instruction, an executable environment with necessary dependencies, previews of input datasets and artifacts, a reference code solution, and an evaluation script that is automatically generated. These evaluation scripts show an impressive 87.5% agreement with human-annotated gold standards, indicating a strong correlation in domain-specific evaluation logic. The dataset seeks to enhance language models and agents involved in scientific discovery by offering reliable benchmarks.
Key facts
- D3-Gym is the first automatically constructed dataset with verifiable environments for scientific data-driven discovery.
- The dataset comprises 565 tasks from 239 real scientific repositories across four disciplines.
- Each task includes a natural language instruction, executable environment, input dataset, artifact previews, reference code solution, and evaluation script.
- Evaluation scripts achieve 87.5% agreement with human-annotated gold standards.
- The dataset addresses the absence of verifiable environments for scientific tasks.
- It is designed to advance language models and agents in data-driven discovery.
Entities
—