ARTFEED — Contemporary Art Intelligence

AutoMat Benchmark Tests LLMs on Reproducing Materials Science Claims

ai-technology · 2026-05-04

A new benchmark called AutoMat evaluates whether large language model-based coding agents can reproduce findings from computational materials science. The benchmark, introduced in arXiv:2605.00803, challenges agents to recover underspecified computational procedures, navigate specialized toolchains, and determine if evidence supports a scientific claim. Claims are curated from real materials science papers with subject matter expert input. The work tests if LLM success on software engineering benchmarks transfers to complex scientific workflows requiring domain-specific procedures and result interpretation.

Key facts

  • AutoMat is a benchmark for evaluating LLM-based agents on computational materials science claim reproduction.
  • It poses three challenges: recovering underspecified procedures, navigating specialized toolchains, and determining evidence support.
  • Claims are curated from real materials science papers with subject matter expert input.
  • The study addresses whether LLM success on software engineering benchmarks transfers to scientific workflows.
  • The benchmark is described in arXiv:2605.00803.
  • Large language models are increasingly deployed as autonomous coding agents.
  • The work focuses on computational materials science.
  • The benchmark requires both coding ability and domain-specific knowledge.

Entities

Institutions

  • arXiv

Sources