OR-Space: A Benchmark for Industrial Optimization Agents
A new benchmark called OR-Space has been developed by researchers to assess large language model (LLM) agents throughout the entire lifecycle of industrial optimization tasks. Unlike traditional benchmarks that merely translate problem statements into mathematical models, OR-Space reflects the intricacies of real-world scenarios, featuring persistent multi-artifact workspaces and multi-stage task lifecycles. Each instance serves as an executable workspace that includes business documents, structured data, code artifacts, solver outputs, and task-specific evaluators across interconnected files. The benchmark outlines three task modes: Build (creating solver-ready models from diverse artifacts), Revise (updating existing models), and Grounded Explanation (justifying model decisions). This initiative bridges the gap between academic assessments and real-world industrial processes, where optimization challenges are dynamic and necessitate ongoing refinement. The research is available on arXiv with the identifier 2605.28158.
Key facts
- OR-Space is a benchmark for industrial optimization agents.
- It evaluates LLM agents across model construction, revision, and grounded explanation.
- Each instance is an executable workspace with business documents, structured data, code artifacts, solver outputs, and evaluators.
- Three task modes: Build, Revise, and Grounded Explanation.
- Existing benchmarks reduce evaluation to one-shot translation from self-contained problem statements.
- OR-Space captures persistent multi-artifact workspaces and multi-stage task lifecycles.
- Published on arXiv with identifier 2605.28158.
- The benchmark aims to bridge the gap between academic evaluation and real industrial workflows.
Entities
Institutions
- arXiv