Stargazer: AI Agent Benchmark for Astrophysics
Stargazer serves as a scalable benchmark framework designed to assess AI agents on dynamic, iterative tasks that involve physics-based model fitting with radial-velocity time series data. It comprises 120 distinct tasks categorized into three levels of difficulty, featuring 20 authentic archival cases that range from high-SNR single-planet systems to intricate multi-planet configurations. An evaluation of eight leading agents highlighted a disparity between numerical optimization and compliance with physical constraints, indicating that while agents frequently attain satisfactory statistical fits, they often struggle to accurately retrieve the correct physical parameters.
Key facts
- Stargazer is a benchmark for AI agents on astrophysical model-fitting tasks.
- It uses radial-velocity time series data.
- Includes 120 tasks across three difficulty tiers.
- 20 tasks are real archival cases.
- Covers high-SNR single-planet to complex multi-planetary systems.
- Eight frontier agents were evaluated.
- Agents often achieve good statistical fit but fail to recover correct physical parameters.
Entities
—