gwBenchmarks Tests LLM Agents on Gravitational Wave Modeling

ai-technology · 2026-05-13

A new benchmark suite called gwBenchmarks evaluates state-of-the-art LLM coding agents on high-precision gravitational wave astronomy tasks. The eight tasks are grounded in analytic calculations and numerical simulations that collectively represent over 10^8 core-hours of compute. They include interpolation, regression, and high-dimensional time-series modeling. Success requires constructing models with relative error below 10^{-4} and reasoning about physical systems such as black hole orbital dynamics and merger remnant properties. The work highlights the potential and limitations of AI in scientific modeling.

Key facts

gwBenchmarks is a suite of eight tasks for LLM coding agents.
Tasks are based on gravitational wave analytic calculations and numerical simulations.
The simulations represent over 10^8 core-hours of compute.
Tasks include interpolation, regression, and high-dimensional time-series modeling.
Models must achieve relative error less than 10^{-4}.
Tasks involve black hole orbital dynamics and merger remnant properties.
The benchmark tests end-to-end scientific modeling by LLMs.
The paper is published on arXiv with ID 2605.11269.

Entities

—

Sources

arXiv cs.AI — 2026-05-13