FactoryBench: New Benchmark Reveals LLMs Struggle with Industrial Machine Understanding

ai-technology · 2026-05-11

A new benchmark named FactoryBench has been launched by researchers to assess the capabilities of time-series models and large language models (LLMs) in interpreting industrial robotic telemetry. This benchmark categorizes question-answer pairs into four causal levels—state, intervention, counterfactual, and decision—following Pearl's causation hierarchy. It includes five answer formats, where four structured types receive deterministic scores, while free-form responses are judged through an LLM-as-judge voting system. The team also established a scalable Q&A generation framework utilizing structured question templates and developed FactoryWave, a comprehensive multitask multivariate sensor dataset sourced from a UR3 cobot and a KUKA KR10 industrial arm. FactoryBench features over 70,000 Q&A items derived from approximately 15,000 normalized episodes from FactoryWave, AURSAD, and voraus-AD. A zero-shot evaluation of six leading LLMs revealed that none surpassed 50% accuracy on structured levels or 18% on decision-making, indicating a considerable shortfall in machine comprehension for industrial uses.

Key facts

FactoryBench evaluates time-series models and LLMs on industrial robotic telemetry understanding.
Q&A pairs are organized along four causal levels: state, intervention, counterfactual, decision.
Answer formats include four structured types and free-form answers scored by LLM-as-judge.
FactoryWave dataset collected from UR3 cobot and KUKA KR10 industrial arm.
Benchmark includes over 70,000 Q&A items from 15,000 normalized episodes.
Data sources: FactoryWave, AURSAD, and voraus-AD.
Zero-shot evaluation of six frontier LLMs showed no model exceeded 50% on structured levels.
No model exceeded 18% on decision-making tasks.
Published on arXiv with ID 2605.07675.

FactoryBench: New Benchmark Reveals LLMs Struggle with Industrial Machine Understanding

Key facts

Entities

Institutions

Sources