MDGYM Benchmark Reveals AI Agents Struggle with Molecular Simulations
A new benchmark called MDGYM tests AI agents on molecular dynamics simulations, revealing poor performance. The benchmark includes 169 expert-curated simulations across LAMMPS and GROMACS packages at three difficulty levels. Evaluated agents—Claude Code, Codex, and OpenHands with four LLMs—solved only 21% of easy tasks and less than 10% at higher difficulties. The work highlights challenges in AI-driven scientific discovery.
Key facts
- MDGYM is a benchmark of 169 expert-curated MD simulations.
- It spans LAMMPS and GROMACS packages.
- Three difficulty levels are included.
- Claude Code, Codex, and OpenHands were evaluated.
- Four LLMs were used in the evaluation.
- Best agent solved only 21% of easy tasks.
- Less than 10% solved at higher difficulties.
- The benchmark tests AI agents on molecular dynamics simulations.
Entities
Institutions
- arXiv