ThermoQA Benchmark Tests LLM Thermodynamic Reasoning
There's a new set of tests called ThermoQA designed to evaluate how well large language models understand engineering thermodynamics. It includes 293 questions split into three parts: 110 for property lookups, 101 for component analysis, and 82 for full cycle analysis. The answers are based on data from CoolProp 7.2.0, which covers water, R-134a refrigerant, and variable-cp air. Six major LLMs were tested with three different attempts each, and the best scores came from Claude Opus 4.6 at 94.1%, followed by GPT-5.4 at 93.1%, and Gemini 3.1 Pro at 92.5%. The performance drop between tiers shows that just memorizing facts isn't enough for real thermodynamic reasoning. The dataset and code can be found on Hugging Face.
Key facts
- ThermoQA benchmark contains 293 open-ended thermodynamics problems
- Three tiers: property lookups (110 Q), component analysis (101 Q), full cycle analysis (82 Q)
- Ground truth computed from CoolProp 7.2.0
- Six frontier LLMs evaluated across three runs each
- Claude Opus 4.6 leads with 94.1% accuracy
- GPT-5.4 scores 93.1%, Gemini 3.1 Pro scores 92.5%
- Cross-tier degradation ranges from 2.8 pp (Opus) to 32.5 pp (MiniMax)
- Supercritical water, R-134a, and combined-cycle gas turbine analysis are key discriminators
- Multi-run sigma ranges from +/-0.1% to +/-2.5%
- Dataset and code are open-source
Entities
Institutions
- arXiv
- Hugging Face
- CoolProp