LLMs Evaluated on Chemical Cost Reasoning with New ChemCost Benchmark
ChemCost, a novel benchmark, assesses large language models (LLMs) in estimating costs for chemical procurement. It includes 1,427 assessable reactions based on a fixed pricing snapshot that encompasses 2,261 chemicals and 230,775 supplier quotes. The benchmark facilitates scalar scoring and allows for stage-level analysis of grounding, retrieval, procurement, and arithmetic errors. This initiative fills a void in the thorough evaluation of LLMs' application in scientific contexts, advancing beyond curated demonstrations or LLM-as-judge evaluations to provide precise, unbiased ground truth. The task requires agents to identify chemical entities, obtain supplier quotes, choose valid purchasable quantities, standardize amounts, and calculate costs based on a reaction description.
Key facts
- ChemCost benchmark includes 1,427 evaluable reactions
- Pricing snapshot covers 2,261 chemicals and 230,775 supplier quotes
- Task involves grounding, retrieval, procurement, and arithmetic steps
- Evaluation uses exact ground truth rather than LLM-as-judge
- Published as arXiv:2605.07251
Entities
Institutions
- arXiv