ARTFEED — Contemporary Art Intelligence

MDGYM Benchmark Reveals AI Agents Struggle with Molecular Simulations

other · 2026-05-12

A new benchmark called MDGYM tests AI agents on molecular dynamics simulations, revealing poor performance. The benchmark includes 169 expert-curated simulations across LAMMPS and GROMACS packages at three difficulty levels. Evaluated agents—Claude Code, Codex, and OpenHands with four LLMs—solved only 21% of easy tasks and less than 10% at higher difficulties. The work highlights challenges in AI-driven scientific discovery.

Key facts

  • MDGYM is a benchmark of 169 expert-curated MD simulations.
  • It spans LAMMPS and GROMACS packages.
  • Three difficulty levels are included.
  • Claude Code, Codex, and OpenHands were evaluated.
  • Four LLMs were used in the evaluation.
  • Best agent solved only 21% of easy tasks.
  • Less than 10% solved at higher difficulties.
  • The benchmark tests AI agents on molecular dynamics simulations.

Entities

Institutions

  • arXiv

Sources