D3-Gym: A Benchmark for Verifiable Scientific Data-Driven Discovery

other · 2026-05-01

The introduction of a new dataset, named D3-Gym, aims to fill the gap in verifiable environments for scientific data-driven discovery tasks. D3-Gym is notable for being the first dataset automatically created, comprising 565 tasks drawn from 239 authentic scientific repositories across four fields. Each task features a natural language instruction, an executable environment with necessary dependencies, previews of input datasets and artifacts, a reference code solution, and an evaluation script that is automatically generated. These evaluation scripts show an impressive 87.5% agreement with human-annotated gold standards, indicating a strong correlation in domain-specific evaluation logic. The dataset seeks to enhance language models and agents involved in scientific discovery by offering reliable benchmarks.

Key facts

D3-Gym is the first automatically constructed dataset with verifiable environments for scientific data-driven discovery.
The dataset comprises 565 tasks from 239 real scientific repositories across four disciplines.
Each task includes a natural language instruction, executable environment, input dataset, artifact previews, reference code solution, and evaluation script.
Evaluation scripts achieve 87.5% agreement with human-annotated gold standards.
The dataset addresses the absence of verifiable environments for scientific tasks.
It is designed to advance language models and agents in data-driven discovery.

Entities

—

Sources

arXiv cs.AI — 2026-05-01