New Benchmark TPS-CalcBench Evaluates LLM Analytical Calculation in Hypersonic Thermal Protection System Engineering

ai-technology · 2026-04-22

A new diagnostic benchmark called TPS-CalcBench has been introduced to evaluate large language models' analytical calculation competence specifically for hypersonic thermal protection system engineering. The framework addresses critical safety concerns in aerospace engineering where inaccurate calculations of stagnation-point heat flux or boundary-layer conditions could lead to catastrophic design failures. Unlike existing scientific benchmarks that only test abstract mathematics and basic physics, TPS-CalcBench focuses on domain-oriented tasks that experienced TPS engineers perform without simulations. The benchmark includes a taxonomy with four difficulty levels and eight categories covering hypersonic aerodynamics and high-temperature gas dynamics. Current evaluation methods are insufficient because they assess only final answers while ignoring the engineering reasoning process, potentially allowing models to produce numerically reasonable but physically invalid responses that are more dangerous than non-responses. The research emphasizes that deploying LLMs as reasoning assistants in safety-critical aerospace applications requires stricter evaluation criteria than general scientific benchmarks provide. The work was announced in arXiv preprint 2604.17966v1.

Key facts

TPS-CalcBench is a diagnostic benchmark for evaluating LLM analytical calculation competence
Focuses on hypersonic thermal protection system engineering applications
Addresses safety-critical concerns where inaccurate calculations could cause catastrophic failures
Includes domain-oriented task taxonomy with 4 difficulty levels and 8 categories
Covers hypersonic aerodynamics and high-temperature gas dynamics
Targets calculations experienced TPS engineers conduct without simulations
Current scientific benchmarks only test abstract math and basic physics
Models producing numerically reasonable but physically invalid answers are considered more dangerous than non-responsive models

Entities

—

Sources

arXiv cs.AI — 2026-04-21