ARTFEED — Contemporary Art Intelligence

ThermoQA Benchmark Tests LLM Thermodynamic Reasoning

ai-technology · 2026-04-24

There's a new set of tests called ThermoQA designed to evaluate how well large language models understand engineering thermodynamics. It includes 293 questions split into three parts: 110 for property lookups, 101 for component analysis, and 82 for full cycle analysis. The answers are based on data from CoolProp 7.2.0, which covers water, R-134a refrigerant, and variable-cp air. Six major LLMs were tested with three different attempts each, and the best scores came from Claude Opus 4.6 at 94.1%, followed by GPT-5.4 at 93.1%, and Gemini 3.1 Pro at 92.5%. The performance drop between tiers shows that just memorizing facts isn't enough for real thermodynamic reasoning. The dataset and code can be found on Hugging Face.

Key facts

  • ThermoQA benchmark contains 293 open-ended thermodynamics problems
  • Three tiers: property lookups (110 Q), component analysis (101 Q), full cycle analysis (82 Q)
  • Ground truth computed from CoolProp 7.2.0
  • Six frontier LLMs evaluated across three runs each
  • Claude Opus 4.6 leads with 94.1% accuracy
  • GPT-5.4 scores 93.1%, Gemini 3.1 Pro scores 92.5%
  • Cross-tier degradation ranges from 2.8 pp (Opus) to 32.5 pp (MiniMax)
  • Supercritical water, R-134a, and combined-cycle gas turbine analysis are key discriminators
  • Multi-run sigma ranges from +/-0.1% to +/-2.5%
  • Dataset and code are open-source

Entities

Institutions

  • arXiv
  • Hugging Face
  • CoolProp

Sources