LLMs Tested on Industrial Maintenance via DiagnosticIQ Benchmark

ai-technology · 2026-05-12

A new benchmark named DiagnosticIQ has been developed by researchers, consisting of 6,690 multiple-choice questions validated by experts, based on 118 rule-action pairs across 16 asset categories. This initiative aims to determine if large language models can convert symbolic maintenance rules into practical actions, addressing the challenge of needing asset-specific expertise in industrial maintenance. The benchmark features a symbolic-to-MCQA pipeline that converts rules into Disjunctive Normal Form with embedding-based distractor sampling, alongside five variants that explore different failure modes (Pro, Pert, Verbose, Aug, Rationale). Evaluations were conducted on 29 LLMs and 4 embedding baselines, with a human assessment involving 9 practitioners (average score 45.0%) indicating that DiagnosticIQ necessitates specialized knowledge beyond just operational context. The source is arXiv:2605.08614.

Key facts

DiagnosticIQ benchmark contains 6,690 expert-validated multiple-choice questions
Derived from 118 rule-action pairs across 16 asset types
Evaluates LLMs for translating symbolic rules to maintenance actions
Includes five variant types: Pro, Pert, Verbose, Aug, Rationale
Tested 29 LLMs and 4 embedding baselines
Human evaluation with 9 practitioners achieved mean 45.0%
Uses symbolic-to-MCQA pipeline with Disjunctive Normal Form
Published on arXiv under ID 2605.08614

LLMs Tested on Industrial Maintenance via DiagnosticIQ Benchmark

Key facts

Entities

Institutions

Sources