ARTFEED — Contemporary Art Intelligence

LLMs Evaluated on Chemical Cost Reasoning with New ChemCost Benchmark

ai-technology · 2026-05-11

ChemCost, a novel benchmark, assesses large language models (LLMs) in estimating costs for chemical procurement. It includes 1,427 assessable reactions based on a fixed pricing snapshot that encompasses 2,261 chemicals and 230,775 supplier quotes. The benchmark facilitates scalar scoring and allows for stage-level analysis of grounding, retrieval, procurement, and arithmetic errors. This initiative fills a void in the thorough evaluation of LLMs' application in scientific contexts, advancing beyond curated demonstrations or LLM-as-judge evaluations to provide precise, unbiased ground truth. The task requires agents to identify chemical entities, obtain supplier quotes, choose valid purchasable quantities, standardize amounts, and calculate costs based on a reaction description.

Key facts

  • ChemCost benchmark includes 1,427 evaluable reactions
  • Pricing snapshot covers 2,261 chemicals and 230,775 supplier quotes
  • Task involves grounding, retrieval, procurement, and arithmetic steps
  • Evaluation uses exact ground truth rather than LLM-as-judge
  • Published as arXiv:2605.07251

Entities

Institutions

  • arXiv

Sources