gwBenchmarks Tests LLM Agents on Gravitational Wave Modeling
A new benchmark suite called gwBenchmarks evaluates state-of-the-art LLM coding agents on high-precision gravitational wave astronomy tasks. The eight tasks are grounded in analytic calculations and numerical simulations that collectively represent over 10^8 core-hours of compute. They include interpolation, regression, and high-dimensional time-series modeling. Success requires constructing models with relative error below 10^{-4} and reasoning about physical systems such as black hole orbital dynamics and merger remnant properties. The work highlights the potential and limitations of AI in scientific modeling.
Key facts
- gwBenchmarks is a suite of eight tasks for LLM coding agents.
- Tasks are based on gravitational wave analytic calculations and numerical simulations.
- The simulations represent over 10^8 core-hours of compute.
- Tasks include interpolation, regression, and high-dimensional time-series modeling.
- Models must achieve relative error less than 10^{-4}.
- Tasks involve black hole orbital dynamics and merger remnant properties.
- The benchmark tests end-to-end scientific modeling by LLMs.
- The paper is published on arXiv with ID 2605.11269.
Entities
—