New Benchmark TPS-CalcBench Evaluates LLM Analytical Calculation in Hypersonic Thermal Protection System Engineering
A new diagnostic benchmark called TPS-CalcBench has been introduced to evaluate large language models' analytical calculation competence specifically for hypersonic thermal protection system engineering. The framework addresses critical safety concerns in aerospace engineering where inaccurate calculations of stagnation-point heat flux or boundary-layer conditions could lead to catastrophic design failures. Unlike existing scientific benchmarks that only test abstract mathematics and basic physics, TPS-CalcBench focuses on domain-oriented tasks that experienced TPS engineers perform without simulations. The benchmark includes a taxonomy with four difficulty levels and eight categories covering hypersonic aerodynamics and high-temperature gas dynamics. Current evaluation methods are insufficient because they assess only final answers while ignoring the engineering reasoning process, potentially allowing models to produce numerically reasonable but physically invalid responses that are more dangerous than non-responses. The research emphasizes that deploying LLMs as reasoning assistants in safety-critical aerospace applications requires stricter evaluation criteria than general scientific benchmarks provide. The work was announced in arXiv preprint 2604.17966v1.
Key facts
- TPS-CalcBench is a diagnostic benchmark for evaluating LLM analytical calculation competence
- Focuses on hypersonic thermal protection system engineering applications
- Addresses safety-critical concerns where inaccurate calculations could cause catastrophic failures
- Includes domain-oriented task taxonomy with 4 difficulty levels and 8 categories
- Covers hypersonic aerodynamics and high-temperature gas dynamics
- Targets calculations experienced TPS engineers conduct without simulations
- Current scientific benchmarks only test abstract math and basic physics
- Models producing numerically reasonable but physically invalid answers are considered more dangerous than non-responsive models
Entities
—