ARTFEED — Contemporary Art Intelligence

AstroAlertBench: Benchmarking Multimodal LLMs for Astronomical Classification

ai-technology · 2026-05-09

AstroAlertBench has been launched by researchers as a multimodal benchmark designed to assess large language models (LLMs) in the classification of astronomical events. This benchmark utilizes 1,500 authentic alerts sourced from the Zwicky Transient Facility (ZTF), which conducts a wide-field survey of the northern sky for transient phenomena. It evaluates models through a three-step logical process: grounding in metadata, scientific reasoning, and hierarchical classification into five distinct categories. A total of thirteen cutting-edge closed-source and open-weight LLMs capable of processing visual input were tested. Findings indicate that even the most sophisticated models face challenges in specialized scientific classification, underscoring a significant obstacle in the automation of astronomical reviews.

Key facts

  • AstroAlertBench is a multimodal benchmark for LLMs in astronomical classification.
  • It uses 1,500 real alerts from the Zwicky Transient Facility (ZTF).
  • The benchmark evaluates metadata grounding, scientific reasoning, and hierarchical classification.
  • Thirteen frontier LLMs (closed-source and open-weight) were tested.
  • Results show LLMs underperform in specialized scientific tasks.

Entities

Institutions

  • Zwicky Transient Facility (ZTF)

Sources