Stargazer: AI Agent Benchmark for Astrophysics

other · 2026-05-13

Stargazer serves as a scalable benchmark framework designed to assess AI agents on dynamic, iterative tasks that involve physics-based model fitting with radial-velocity time series data. It comprises 120 distinct tasks categorized into three levels of difficulty, featuring 20 authentic archival cases that range from high-SNR single-planet systems to intricate multi-planet configurations. An evaluation of eight leading agents highlighted a disparity between numerical optimization and compliance with physical constraints, indicating that while agents frequently attain satisfactory statistical fits, they often struggle to accurately retrieve the correct physical parameters.

Key facts

Stargazer is a benchmark for AI agents on astrophysical model-fitting tasks.
It uses radial-velocity time series data.
Includes 120 tasks across three difficulty tiers.
20 tasks are real archival cases.
Covers high-SNR single-planet to complex multi-planetary systems.
Eight frontier agents were evaluated.
Agents often achieve good statistical fit but fail to recover correct physical parameters.

Entities

—

Sources

arXiv cs.AI — 2026-05-13