LLMs Tested on Industrial Maintenance via DiagnosticIQ Benchmark
A new benchmark named DiagnosticIQ has been developed by researchers, consisting of 6,690 multiple-choice questions validated by experts, based on 118 rule-action pairs across 16 asset categories. This initiative aims to determine if large language models can convert symbolic maintenance rules into practical actions, addressing the challenge of needing asset-specific expertise in industrial maintenance. The benchmark features a symbolic-to-MCQA pipeline that converts rules into Disjunctive Normal Form with embedding-based distractor sampling, alongside five variants that explore different failure modes (Pro, Pert, Verbose, Aug, Rationale). Evaluations were conducted on 29 LLMs and 4 embedding baselines, with a human assessment involving 9 practitioners (average score 45.0%) indicating that DiagnosticIQ necessitates specialized knowledge beyond just operational context. The source is arXiv:2605.08614.
Key facts
- DiagnosticIQ benchmark contains 6,690 expert-validated multiple-choice questions
- Derived from 118 rule-action pairs across 16 asset types
- Evaluates LLMs for translating symbolic rules to maintenance actions
- Includes five variant types: Pro, Pert, Verbose, Aug, Rationale
- Tested 29 LLMs and 4 embedding baselines
- Human evaluation with 9 practitioners achieved mean 45.0%
- Uses symbolic-to-MCQA pipeline with Disjunctive Normal Form
- Published on arXiv under ID 2605.08614
Entities
Institutions
- arXiv