MDGYM Benchmark Reveals AI Agents Struggle with Molecular Simulations

other · 2026-05-12

A new benchmark called MDGYM tests AI agents on molecular dynamics simulations, revealing poor performance. The benchmark includes 169 expert-curated simulations across LAMMPS and GROMACS packages at three difficulty levels. Evaluated agents—Claude Code, Codex, and OpenHands with four LLMs—solved only 21% of easy tasks and less than 10% at higher difficulties. The work highlights challenges in AI-driven scientific discovery.

Key facts

MDGYM is a benchmark of 169 expert-curated MD simulations.
It spans LAMMPS and GROMACS packages.
Three difficulty levels are included.
Claude Code, Codex, and OpenHands were evaluated.
Four LLMs were used in the evaluation.
Best agent solved only 21% of easy tasks.
Less than 10% solved at higher difficulties.
The benchmark tests AI agents on molecular dynamics simulations.

MDGYM Benchmark Reveals AI Agents Struggle with Molecular Simulations

Key facts

Entities

Institutions

Sources