AstroAlertBench: Benchmarking Multimodal LLMs for Astronomical Classification
AstroAlertBench has been launched by researchers as a multimodal benchmark designed to assess large language models (LLMs) in the classification of astronomical events. This benchmark utilizes 1,500 authentic alerts sourced from the Zwicky Transient Facility (ZTF), which conducts a wide-field survey of the northern sky for transient phenomena. It evaluates models through a three-step logical process: grounding in metadata, scientific reasoning, and hierarchical classification into five distinct categories. A total of thirteen cutting-edge closed-source and open-weight LLMs capable of processing visual input were tested. Findings indicate that even the most sophisticated models face challenges in specialized scientific classification, underscoring a significant obstacle in the automation of astronomical reviews.
Key facts
- AstroAlertBench is a multimodal benchmark for LLMs in astronomical classification.
- It uses 1,500 real alerts from the Zwicky Transient Facility (ZTF).
- The benchmark evaluates metadata grounding, scientific reasoning, and hierarchical classification.
- Thirteen frontier LLMs (closed-source and open-weight) were tested.
- Results show LLMs underperform in specialized scientific tasks.
Entities
Institutions
- Zwicky Transient Facility (ZTF)