LLMs Evaluated on Chemical Cost Reasoning with New ChemCost Benchmark

ai-technology · 2026-05-11

ChemCost, a novel benchmark, assesses large language models (LLMs) in estimating costs for chemical procurement. It includes 1,427 assessable reactions based on a fixed pricing snapshot that encompasses 2,261 chemicals and 230,775 supplier quotes. The benchmark facilitates scalar scoring and allows for stage-level analysis of grounding, retrieval, procurement, and arithmetic errors. This initiative fills a void in the thorough evaluation of LLMs' application in scientific contexts, advancing beyond curated demonstrations or LLM-as-judge evaluations to provide precise, unbiased ground truth. The task requires agents to identify chemical entities, obtain supplier quotes, choose valid purchasable quantities, standardize amounts, and calculate costs based on a reaction description.

Key facts

ChemCost benchmark includes 1,427 evaluable reactions
Pricing snapshot covers 2,261 chemicals and 230,775 supplier quotes
Task involves grounding, retrieval, procurement, and arithmetic steps
Evaluation uses exact ground truth rather than LLM-as-judge
Published as arXiv:2605.07251

LLMs Evaluated on Chemical Cost Reasoning with New ChemCost Benchmark

Key facts

Entities

Institutions

Sources