ThermoQA Benchmark Tests LLM Thermodynamic Reasoning

ai-technology · 2026-04-24

There's a new set of tests called ThermoQA designed to evaluate how well large language models understand engineering thermodynamics. It includes 293 questions split into three parts: 110 for property lookups, 101 for component analysis, and 82 for full cycle analysis. The answers are based on data from CoolProp 7.2.0, which covers water, R-134a refrigerant, and variable-cp air. Six major LLMs were tested with three different attempts each, and the best scores came from Claude Opus 4.6 at 94.1%, followed by GPT-5.4 at 93.1%, and Gemini 3.1 Pro at 92.5%. The performance drop between tiers shows that just memorizing facts isn't enough for real thermodynamic reasoning. The dataset and code can be found on Hugging Face.

Key facts

ThermoQA benchmark contains 293 open-ended thermodynamics problems
Three tiers: property lookups (110 Q), component analysis (101 Q), full cycle analysis (82 Q)
Ground truth computed from CoolProp 7.2.0
Six frontier LLMs evaluated across three runs each
Claude Opus 4.6 leads with 94.1% accuracy
GPT-5.4 scores 93.1%, Gemini 3.1 Pro scores 92.5%
Cross-tier degradation ranges from 2.8 pp (Opus) to 32.5 pp (MiniMax)
Supercritical water, R-134a, and combined-cycle gas turbine analysis are key discriminators
Multi-run sigma ranges from +/-0.1% to +/-2.5%
Dataset and code are open-source

ThermoQA Benchmark Tests LLM Thermodynamic Reasoning

Key facts

Entities

Institutions

Sources