IndustryBench: New Benchmark Tests LLMs on Industrial Procurement Standards

ai-technology · 2026-05-12

A new benchmark named IndustryBench has been developed to assess large language models (LLMs) in the context of industrial procurement question answering in Chinese, comprising 2,049 items. This benchmark is based on Chinese national standards (GB/T) and organized industrial product records, categorized into ten industry sectors and seven capability dimensions, with varying difficulty levels determined by expert panels. During its construction, 70.3% of candidates generated by LLMs were discarded in a search-based external verification phase, underscoring the limitations of relying solely on LLMs for industrial QA. The evaluation separates raw accuracy, assessed by a Qwen3-Max model, from safety-critical consistency. Additionally, the benchmark features aligned translations in English, Russian, and Vietnamese. This research addresses the issue of partial correctness in LLM responses, which can obscure significant contradictions often overlooked by aggregate benchmarks. The paper can be found on arXiv with the reference 2605.10267.

Key facts

IndustryBench is a 2,049-item benchmark for industrial procurement QA in Chinese.
It is grounded in Chinese national standards (GB/T) and structured industrial product records.
The benchmark covers seven capability dimensions, ten industry categories, and panel-derived difficulty tiers.
The construction pipeline rejected 70.3% of LLM-generated candidates at a search-based external-verification stage.
Evaluation decouples raw correctness from safety-critical consistency.
Raw correctness is scored by a Qwen3-Max model.
The benchmark includes item-aligned English, Russian, and Vietnamese renderings.
The paper is published on arXiv with ID 2605.10267.

IndustryBench: New Benchmark Tests LLMs on Industrial Procurement Standards

Key facts

Entities

Institutions

Sources