Collider-Bench: AI Agents Tested on Particle Physics Reproductions

ai-technology · 2026-05-16

A new benchmark called Collider-Bench has been developed by researchers to assess the capability of large language model (LLM) agents in replicating experimental analyses from the Large Hadron Collider (LHC). This benchmark challenges agents to transform published analysis papers into functional simulation-and-selection pipelines, utilizing only publicly available papers and open scientific software. The initiative aims to tackle the challenges of reproducing LHC analyses, which often stem from approximations in public toolchains and incomplete implementation details in the literature. Agents are required to use physical reasoning, domain expertise, and trial-and-error to address these gaps. Each task involves predicting collision event yields in designated signal regions. The findings are detailed in arXiv:2605.13950.

Key facts

Collider-Bench is a benchmark for LLM agents on long-horizon tool-use tasks in particle physics.
It requires reproducing LHC experimental analyses from public papers and open software.
Public toolchains only approximate internal software used by experimental collaborations.
Published papers omit implementation details needed for faithful reconstruction.
Agents must use physical reasoning, domain knowledge, and trial-and-error.
Each task involves turning a published analysis into an executable pipeline.
Agents submit predicted collision event yields in specified signal regions.
The benchmark is introduced in arXiv:2605.13950.

Collider-Bench: AI Agents Tested on Particle Physics Reproductions

Key facts

Entities

Institutions

Sources