AstroAlertBench: Benchmarking Multimodal LLMs for Astronomical Classification

ai-technology · 2026-05-09

AstroAlertBench has been launched by researchers as a multimodal benchmark designed to assess large language models (LLMs) in the classification of astronomical events. This benchmark utilizes 1,500 authentic alerts sourced from the Zwicky Transient Facility (ZTF), which conducts a wide-field survey of the northern sky for transient phenomena. It evaluates models through a three-step logical process: grounding in metadata, scientific reasoning, and hierarchical classification into five distinct categories. A total of thirteen cutting-edge closed-source and open-weight LLMs capable of processing visual input were tested. Findings indicate that even the most sophisticated models face challenges in specialized scientific classification, underscoring a significant obstacle in the automation of astronomical reviews.

Key facts

AstroAlertBench is a multimodal benchmark for LLMs in astronomical classification.
It uses 1,500 real alerts from the Zwicky Transient Facility (ZTF).
The benchmark evaluates metadata grounding, scientific reasoning, and hierarchical classification.
Thirteen frontier LLMs (closed-source and open-weight) were tested.
Results show LLMs underperform in specialized scientific tasks.

AstroAlertBench: Benchmarking Multimodal LLMs for Astronomical Classification

Key facts

Entities

Institutions

Sources