FrontierOR Benchmark Tests LLMs on Large-Scale Optimization

publication · 2026-05-26

FrontierOR has been launched by researchers as a benchmark to assess large language models (LLMs) in their ability to design efficient algorithms for complex large-scale optimization challenges. This benchmark features 180 tasks sourced from leading operations research publications, each accompanied by standardized instances and a concealed, expert-validated evaluation framework. Seven LLMs, including cutting-edge, cost-efficient, and open-source models, were tested in both one-shot and test-time evolution scenarios. Findings indicate that current LLMs face difficulties in scalable algorithm design, frequently underperforming compared to direct formulation-and-solve methods. FrontierOR seeks to enhance LLM performance in operations research by emphasizing the exploitation of problem structures and scalability, tackling the shortcomings of existing benchmarks that focus on smaller or simplified cases.

Key facts

FrontierOR is among the first benchmarks to systematically evaluate LLM-based efficient algorithm design for realistic large-scale optimization problems.
The benchmark includes 180 tasks derived from methodologically diverse papers published in top-tier operations research venues.
Each task has standardized instances and a hidden, expert-verified evaluation suite.
Seven LLMs were evaluated, spanning frontier, cost-effective, and open-source models.
Evaluation was conducted in both one-shot and test-time evolution settings.
Results reveal that current LLMs struggle with scalable algorithm design.
Existing benchmarks are limited to small or simplified examples far below real-world scale and complexity.
The work is published on arXiv with ID 2605.25246.

FrontierOR Benchmark Tests LLMs on Large-Scale Optimization

Key facts

Entities

Institutions

Sources